23
Population Genomics of Paramecium Species Parul Johri,* ,1 Sascha Krenek, 2 Georgi K. Marinov, 1 Thomas G. Doak, 1,3 Thomas U. Berendonk, 2 and Michael Lynch 1 1 Department of Biology, Indiana University, Bloomington, IN 2 Institute of Hydrobiology, Technische Universit at Dresden, Dresden, Germany 3 National Center for Genome Analysis Support, Indiana University, Bloomington, IN *Corresponding author: E-mail: [email protected]. Associate editor: Stephen Wright Abstract Population-genomic analyses are essential to understanding factors shaping genomic variation and lineage-specific sequence constraints. The dearth of such analyses for unicellular eukaryotes prompted us to assess genomic variation in Paramecium, one of the most well-studied ciliate genera. The Paramecium aurelia complex consists of 15 mor- phologically indistinguishable species that diverged subsequent to two rounds of whole-genome duplications (WGDs, as long as 320 MYA) and possess extremely streamlined genomes. We examine patterns of both nuclear and mito- chondrial polymorphism, by sequencing whole genomes of 10–13 worldwide isolates of each of three species belonging to the P. aurelia complex: P. tetraurelia, P. biaurelia, P. sexaurelia, as well as two outgroup species that do not share the WGDs: P. caudatum and P. multimicronucleatum. An apparent absence of global geographic population structure suggests continuous or recent dispersal of Paramecium over long distances. Intergenic regions are highly constrained relative to coding sequences, especially in P. caudatum and P. multimicronucleatum that have shorter intergenic distances. Sequence diversity and divergence are reduced up to 100–150 bp both upstream and downstream of genes, suggesting strong constraints imposed by the presence of densely packed regulatory modules. In addition, comparison of sequence variation at non-synonymous and synonymous sites suggests similar recent selective pressures on paralogs within and orthologs across the deeply diverging species. This study presents the first genome-wide population-genomic analysis in ciliates and provides a valuable resource for future studies in evolutionary and func- tional genetics in Paramecium. Key words: Population genomics, Paramecium, mitochondrial genomic variation. Introduction Quantifying within-species genetic variation is important for a population-level understanding of key processes in evolu- tion. Studies of patterns of variation across the genome have shed light on factors shaping levels of variation, e.g., recom- bination, mutation, drift, and selection. Moreover, analysis of variant frequencies and comparisons of polymorphisms and divergence have allowed the identification of lineage-specific regions of the genome under functional constraint (Abecasis et al. 2012; Schrider and Kern 2015) and those experiencing recent positive selection (Voight et al. 2006; Tishkoff et al. 2007). Finally, population-genomic studies have uncovered spatial structure (e.g., Novembre et al. 2008) and local adap- tation (e.g., Fournier-Level et al. 2011; Hancock et al. 2011) in populations. Such studies have, however, generally been re- stricted to multicellular organisms such as humans (Abecasis et al. 2010; Abecasis et al. 2012; Auton et al. 2015), Drosophila (Begun et al. 2007; Langley et al. 2012), Arabidopsis (Cao et al. 2011), maize (Chia et al. 2012), rice (Huang et al. 2012; Huang, et al. 2015), and Caenorhabditis brenneri (Dey et al. 2013), with the fungi Saccharomyces cerevisiae (Liti et al. 2009; Schacherer et al. 2009), Schizosaccharomyces pombe (Fawcett et al. 2014) and Neurospora crassa (Ellison et al. 2011) being the primary exceptions among unicellular eukaryotes. Although unicellular organisms span all eukaryotic sub- groups and comprise most of the eukaryotic phylogenetic diversity (Burki 2014), they are heavily under-studied (Pawlowski et al. 2012; Lynch et al. 2014), represent- ing < 11% of all eukaryotic genomes available on NCBI (as of February 2017, calculated from http://www.ncbi.nlm.nih. gov/genome/browse/). Measurements of genome-wide var- iation in unicellular eukaryotes, besides yeast, have largely been restricted to pathogenic species such as Plasmodium (Volkman et al. 2007; Neafsey et al. 2012; Assefa et al. 2015) and Trypanosoma (Ackermann et al. 2012; Weir et al. 2016) species, Leishmania donovani (Downing et al. 2011), Cryptococcus gattii (Engelthaler et al. 2014), etc. Additional free-living unicellular eukaryotes for which the levels of ge- netic variation have been obtained from a limited number of loci (1–7) include Chlamydomonas reinhardtii (Smith and Lee 2008), Tetrahymena thermophila (Katz et al. 2006), and the Paramecium aurelia complex (Catania et al. 2009). There are two main challenges to studying diversity across unicellular eukaryotes. First, characterizing genetic variation Article ß The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] 1194 Mol. Biol. Evol. 34(5):1194–1216 doi:10.1093/molbev/msx074 Advance Access publication February 13, 2017 Downloaded from https://academic.oup.com/mbe/article-abstract/34/5/1194/2992914 by Arizona State University user on 02 July 2018

Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Population Genomics of Paramecium Species

Parul Johri1 Sascha Krenek2 Georgi K Marinov1 Thomas G Doak13 Thomas U Berendonk2 andMichael Lynch1

1Department of Biology Indiana University Bloomington IN2Institute of Hydrobiology Technische Universiteuroat Dresden Dresden Germany3National Center for Genome Analysis Support Indiana University Bloomington IN

Corresponding author E-mail pjohriindianaedu

Associate editor Stephen Wright

Abstract

Population-genomic analyses are essential to understanding factors shaping genomic variation and lineage-specificsequence constraints The dearth of such analyses for unicellular eukaryotes prompted us to assess genomic variationin Paramecium one of the most well-studied ciliate genera The Paramecium aurelia complex consists of 15 mor-phologically indistinguishable species that diverged subsequent to two rounds of whole-genome duplications (WGDsas long as 320 MYA) and possess extremely streamlined genomes We examine patterns of both nuclear and mito-chondrial polymorphism by sequencing whole genomes of 10ndash13 worldwide isolates of each of three species belongingto the P aurelia complex P tetraurelia P biaurelia P sexaurelia as well as two outgroup species that do not sharethe WGDs P caudatum and P multimicronucleatum An apparent absence of global geographic population structuresuggests continuous or recent dispersal of Paramecium over long distances Intergenic regions are highly constrainedrelative to coding sequences especially in P caudatum and P multimicronucleatum that have shorter intergenicdistances Sequence diversity and divergence are reduced up to 100ndash150 bp both upstream and downstream ofgenes suggesting strong constraints imposed by the presence of densely packed regulatory modules In additioncomparison of sequence variation at non-synonymous and synonymous sites suggests similar recent selective pressureson paralogs within and orthologs across the deeply diverging species This study presents the first genome-widepopulation-genomic analysis in ciliates and provides a valuable resource for future studies in evolutionary and func-tional genetics in Paramecium

Key words Population genomics Paramecium mitochondrial genomic variation

IntroductionQuantifying within-species genetic variation is important fora population-level understanding of key processes in evolu-tion Studies of patterns of variation across the genome haveshed light on factors shaping levels of variation eg recom-bination mutation drift and selection Moreover analysis ofvariant frequencies and comparisons of polymorphisms anddivergence have allowed the identification of lineage-specificregions of the genome under functional constraint (Abecasiset al 2012 Schrider and Kern 2015) and those experiencingrecent positive selection (Voight et al 2006 Tishkoff et al2007) Finally population-genomic studies have uncoveredspatial structure (eg Novembre et al 2008) and local adap-tation (eg Fournier-Level et al 2011 Hancock et al 2011) inpopulations Such studies have however generally been re-stricted to multicellular organisms such as humans(Abecasis et al 2010 Abecasis et al 2012 Auton et al2015) Drosophila (Begun et al 2007 Langley et al 2012)Arabidopsis (Cao et al 2011) maize (Chia et al 2012) rice(Huang et al 2012 Huang et al 2015) and Caenorhabditisbrenneri (Dey et al 2013) with the fungi Saccharomycescerevisiae (Liti et al 2009 Schacherer et al 2009)Schizosaccharomyces pombe (Fawcett et al 2014) and

Neurospora crassa (Ellison et al 2011) being the primaryexceptions among unicellular eukaryotes

Although unicellular organisms span all eukaryotic sub-groups and comprise most of the eukaryotic phylogeneticdiversity (Burki 2014) they are heavily under-studied(Pawlowski et al 2012 Lynch et al 2014) represent-inglt 11 of all eukaryotic genomes available on NCBI (asof February 2017 calculated from httpwwwncbinlmnihgovgenomebrowse) Measurements of genome-wide var-iation in unicellular eukaryotes besides yeast have largelybeen restricted to pathogenic species such as Plasmodium(Volkman et al 2007 Neafsey et al 2012 Assefa et al 2015)and Trypanosoma (Ackermann et al 2012 Weir et al 2016)species Leishmania donovani (Downing et al 2011)Cryptococcus gattii (Engelthaler et al 2014) etc Additionalfree-living unicellular eukaryotes for which the levels of ge-netic variation have been obtained from a limited numberof loci (1ndash7) include Chlamydomonas reinhardtii (Smithand Lee 2008) Tetrahymena thermophila (Katz et al2006) and the Paramecium aurelia complex (Cataniaet al 2009)

There are two main challenges to studying diversity acrossunicellular eukaryotes First characterizing genetic variation

Article

The Author 2017 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionAll rights reserved For permissions please e-mail journalspermissionsoupcom

1194 Mol Biol Evol 34(5)1194ndash1216 doi101093molbevmsx074 Advance Access publication February 13 2017Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

has been especially difficult in microbes due to frequent dis-cordance between morphological and phylogenetic speciesconcepts and difficulties in performing mating tests underprecise laboratory conditions (Lahr et al 2014 Pawlowskiet al 2012) For instance many lineages in ciliates havebeen shown to harbor a number of cryptic species that aremorphologically identical but reproductively isolated(Foissner et al 2008 Krenek et al 2015) Some cryptic speciescomplexes in ciliated protozoans include the P aurelia com-plex (Sonneborn 1975) Halteria grandinella (Katz et al 2005)and Tetrahymena pyriformis (Gruchy 1955) Second the sam-pling of a few selected loci does not provide a true represen-tation of genetic diversity across the genome and this may inpart be responsible for contradictory reports of extremelyhigh or low nucleotide diversity in Paramecium (Hori et al2006 Snoke et al 2006 Catania et al 2009) Paramecium isone of the most well-studied ciliates and cryptic species inthe Paramecium aurelia complex have now been extensivelycharacterized and resolved (Sonneborn 1975 Beale and Preer2008) making it an ideal candidate in which to characterizegenome-wide variation

Like all ciliates Paramecium species possess two kinds ofnuclei a macronucleus (MAC) and at least one micronu-cleus (MIC) (Prescott 1994) The MIC harbors a diploidtranscriptionally silent germline genome while the MACharbors a streamlined polyploid version of the MIC genomeand is transcriptionally active (Chalker and Yao 2011) As aresult of the massive genome rearrangements and sequenceexcisions that occur during MAC formation (Betermier2004) the Paramecium MAC genome is relatively small(35ndash70 Mb) and almost entirely devoid of repetitiveDNA In addition because P aurelia species can undergoautogamy (self-fertilization) their genomes can be reducedto a homozygous state prior to genomic sequencing Thesefeatures allow for high-quality reference genome assemblies(McGrath et al 2014a 2014b) More interestingly theParamecium macronuclear genome is extremely compactwith one of the shortest known average intron lengths (20ndash25 bps) (Russell et al 1994) and intergenic regions compris-ing an average of 100ndash200 bps The following study thereforeelucidates the forces shaping genetic variation across one ofthe most compact eukaryotic genomes

Another fascinating feature of the P aurelia complex isthat despite the simplicity of their genomes the species di-verged following at least two rounds of whole-genome dupli-cations (WGDs) that are not shared by the outgroup speciesand 50 of genes duplicated by the most recent WGD(320 MYA) are still maintained in duplicate copies (Auryet al 2006 McGrath et al 2014b) The Paramecium aureliacomplex thus has extremely gene-rich genomes (40000protein coding genes in each species) and provides an excel-lent platform for studying the evolutionary forces responsiblefor the retention or loss of genes after WGD

We thus aim to characterize genome-wide patterns ofvariation in the macronucleus of five deeply diverging speciesbelonging to the genus Paramecium three species belongingto the Paramecium aurelia complex (P tetraurelia P biaur-elia P sexaurelia) and two of the closest outgroups to the

complex (P caudatum and P multimicronucleatum) Thispopulation-genomic study involves resequencing whole ge-nomes of 10ndash13 isolates of each species sampled worldwideThe study provides insights about the global populationstructure of each species and the genome-wide diversity levelsacross different sites and genomic elements Heterozygositylevels are used to shed light on constraints on noncodingregions as well as individual genes This study provides aunique and extensive dataset to understand genomic varia-tion across five separate free-living microbial eukaryotes inboth the macronucleus and the mitochondrion and willserve as a valuable resource for future studies of the compar-ative and population genomics of Paramecium

Results and Discussion

Genetic Variation in ParameciumWe sequenced the macronuclear genomes of 10ndash13 isolatesof each of the five species P tetraurelia P biaurelia P sex-aurelia P caudatum and P multimicronucleatum (totaling 56individuals) attempting to sample as globally as possible (de-tails in supplementary table S1 Supplementary Material on-line) Reads were mapped (see Methods for details) topreviously assembled and annotated reference genomes foreach species (Marinov et al in preparation McGrath et al2014a 2014b) Most strains mapped to 85-97 of the refer-ence genome with the exception of two isolates of P cauda-tum and four isolates of P multimicronucleatum where asmaller fraction (50ndash70) mapped to the reference genome(supplementary table S1 Supplementary Material online)Isolates mapping tolt 75 of the reference genome are here-after referred to as divergent isolates We thus conductedeach of our analyses in P caudatum and P multimicronuclea-tum with two data sets one including the set of all sampledstrains and a subset (one excluding the divergent isolates)

We identified a total of 545750 (P tetraurelia) 948377 (Pbiaurelia) 3013676 (P sexaurelia) 4812175 (P caudatum)and 4333282 (P multimicronucleatum) single-nucleotidepolymorphisms (SNPs) representing about 07 13 47179 and 140 of their genomes respectively When thedivergent isolates are excluded there is a significant reductionin the number of SNPs identified in P caudatum and P multi-micronucleatum to 2218282 and 2502368 respectively inboth cases comprising 74 of the genome

Population Structure of ParameciumAll Paramecium species in this study are cosmopolitan andfound in freshwater bodies across varying habitatsclimatezones (Przybos 2005 Przybos et al 2008 Przybos andSurmacz 2010) Three different hypotheses might explainsuch global distributions First the ubiquity model or ldquoevery-thing is everywhererdquo hypothesis suggests that high and con-tinuous dispersal rates of microbes would result in their globaldistribution with no genetic isolation by distance (Fencheland Finlay 2004 Finlay 2002) Paramecium has been proposedto disperse from one expanse of water to another by insectsbirds and larger migrating animals (Maguire and Belk 1967)including artificial dispersal by human activities (Foissner

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1195Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

2011) Alternatively it is possible that the rates of dispersal arenot very high but rare events of migration over a long periodof time have resulted in cosmopolitan distribution (Foissner2006) Lastly it is possible that the older Paramecium speciesP multimicronucleatum P caudatum and P sexaurelia(McGrath et al 2014b) were distributed worldwide beforethe continents separated and therefore achieved a global dis-tribution without extensive recent migration (Foissner 2008Weisse 2008) Such scenarios are plausible given thatParamecia do not form cysts (Sonneborn 1975) and thustheir long-distance dispersal capabilities might be limited Aconsequence of either of the latter two scenarios would bestrong subdivision among geographically distant populationsIn principle understanding the worldwide population struc-ture of Paramecium would provide more insight into theongoing debate of whether unicellular eukaryote populationshave biogeographical patterns similar to those of many multi-cellular organisms (Foissner 2008 Foissner et al 2008)

To gain more insight into this issue we built phylogenetictrees using 883 single-copy orthologous protein-coding genesthat are present in all species For each orthologous gene setall sequences (derived from each isolate and reference ge-nomes) were concatenated into one supermatrix andRAxML8026 (Stamatakis 2014) was used to generate a phy-logenetic tree under the substitution model GTRGAMMA(fig 1) We also generated gene trees for each of the orthol-ogous gene sets separately and obtained a consensus tree(majority rule) (supplementary fig S1 SupplementaryMaterial online) (Aberer et al 2010) Both trees exhibit verysimilar topologies and all species are monophyletic The onlyexception is one strain assigned to P biaurelia (B_A Poland)that appears to cluster within the P tetraurelia clade and forwhich the sequencing data map to 97 (34 coverage) ofthe P tetraurelia reference genome We therefore excluded thismost likely misidentified strain from all subsequent analyses

P caudatum consists of two major clades (A and B genome-wide Fstfrac14 038) and two outgroup strains which represent thetwo divergent isolates noted above The two major clades Aand B which do not assort geographically represent the twohaplotype groups that P caudatum has been suggested to par-tition into previously (Barth et al 2006 Hori et al 2006 Kreneket al 2015) Interestingly 92 of genes support the monophylyof Clade A whereas only 77 of genes support the monophylyof Clade B (supplementary fig S1 Supplementary Material on-line) possibly suggesting unidirectional gene flow between thetwo clades Similarly P multimicronucleatum partitions into twomajor clades (C and D genome-wide Fstfrac14 050) and all fourdivergent isolates form clade D

P caudatum subset (Clade A and B) and P multimicronu-cleatum subset (Clade C) seem to be comprised of individualsthat can successfully mate Multiple lines of evidence havesuggested that clades A and B are not reproductively isolated(Hiwatashi 1968 Tsukii and Hiwatashi 1983 1985 Stoeck et al2000 Hori et al 2006) and are not different species accordingto the biological species concept Although all strains used inthe study were identified via mating tests conducted in pre-vious studies the extent of reproductive isolation betweenclades C and D is not entirely clear We therefore consider

both P caudatum and P multimicronucleatum subsets to becomprised of a conservative set of individuals that are notreproductively isolated

Remarkably in none of the species do isolates appear togroup together in accordance with geographical location sug-gesting the absence of strong geographic population struc-ture To explore this issue we performed principlecomponent analysis (PCA) with approximately 10000 uni-formly distributed SNPs for each of the five species separately(fig 2 supplementary fig S2 Supplementary Material online)In general the genetic distances between isolates do not ap-pear to correlate with geography with the exception of Pbiaurelia wherein the PCA plots suggest a USA-Europe sepa-ration and mirror geography mildly within Europe In addi-tion there is a lack of any significant correlation by linearregression between pairwise geographical and genetic dis-tance between isolates within each species (fig 2 supplementary fig S2 Supplementary Material online)

Conversely we find that any two isolates sampled fromexactly the same lake or city are always closely related onphylogenies and are indistinguishable by PCA This can beseen in P biaurelia (B_256-UB4 and B_256-UB2 from USABoston) and P sexaurelia (S_133 and S_132 from SpainSeville S_129 and S_130 from Greece Loannina Lake)None of these pairs of isolates are identical with 117936896 and 3736 bases (000058 000015 and 000010 di-vergence per site) at which their genotypes differ respectively

Thus overall within the limitations of our sampling we donot observe any clear population structure correlated withgeography among any of the investigated Paramecium spe-cies suggesting that these ciliates are effectively ubiquitouslydistributed and hence provide support for the ldquocosmopolitanmodelrdquo or the ldquoeverything is everywhererdquo hypothesis (Finlay2002 Finlay et al 2004) As an exception P biaurelia mightmildly support the ldquomoderate endemicity modelrdquo (Foissner2006 2008 1999) Although many microbial species have nowbeen shown to exhibit biogeographies (reviewed in Foissner2006) no such patterns are observed on the continental scalein the examined Paramecium species Finally the scenariothat the worldwide distribution of these species occurredonly before the split of Pangea is strongly refuted Our findingsrather suggest recent or ongoing migration and gene flow ofParamecium across the globe supporting observations fromprevious studies (Przybos et al 2008 Catania et al 2009Krenek et al 2012) Interestingly although there appears tobe a correlation between physical and genetic distance on avery small geographical scale this correlation disappearswhen evaluated globally across continents This could be aresult of very different mechanisms and rates of dispersal atshorter versus larger distances for instance via organismalvectors (short-distance carriers) and global atmospheric cir-culation or human-mediated transport (long-distance dis-persal) It is also possible that autogamy in the P aureliaspecies plays a role in reducing genetic variation within pop-ulations and increasing genetic differentiation between pop-ulations as observed in self-fertilizing plant species (Lovelessand Hamrick 1984 Abbott and Gomes 1989 Sun 1999Charlesworth 2003)

Johri et al doi101093molbevmsx074 MBE

1196Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Nucleotide Diversity across the Nuclear GenomeNucleotide diversity (measured by p) at 4-fold degenerate sitesacross the genome is extremely high in both P caudatum(0135) and P multimicronucleatum (0159) but much lowerfor the P aurelia species 0027 for P sexaurelia 0009 for Pbiaurelia and 0006 for P tetraurelia Upon excluding the twodivergent strains in P caudatum the nucleotide diversity at 4-

fold degenerate sites halves to about 0069 Similarly uponexcluding the four divergent strains in P multimicronucleatumthe diversity levels fall to 0074 Table 1 shows the nucleotidediversity across different site-types intronic coding (4-fold de-generate 2-fold degenerate and 0-fold degenerate) and inter-genic regions (noncoding regions that extend from the startcodon of a gene to the startstop codon of the adjacent gene)

FIG 1 A concatenated tree of all single-copy gene sequences (maximum likelihood GTRGAMMA) in five species P tetraurelia (green T) Pbiaurelia (blue B) P sexaurelia (orange S) P caudatum (red C) and P multimicronucleatum (purple M) shows that all species are monophyleticexcluding one strain of P biaurelia (B_A) that branches with P tetraurelia Clade A and B together constitute P caudatum subset and Clade Crepresents the P multimicronucleatum subset Isolates marked with represent the divergent isolates that mapped tolt75 of the referencegenome The numbers represent maximum likelihood estimates of branch lengths (mean number of substitutions per site)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1197Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Expected heterozygosity in introns increases significantlyacross all species upon exclusion of the first three and lastthree sites within each intron indicating that these sites areconserved presumably for splicing Overall diversity inintronic and intergenic regions is remarkably low comparedto the diversity levels at 4-fold degenerate sites for all species(also seen in supplementary fig S3 Supplementary Material

online) suggesting that the non-coding regions inParamecium are under strong selective constraints most likelybecause they harbor functionally important splice sites andregulatory modules respectively Interestingly while nucleotidediversity levels in the P aurelia species are much lower inintrons relative to intergenic sequences this trend is reversedin the two outgroups This observation is consistent with the

FIG 2 Population structure of Paramecium species Left Principal component analysis using genome-wide genotypes for P tetraurelia P biaureliaand P sexaurelia Right Average pairwise genetic distance as a function of physical distance between individuals within P tetraurelia P biaureliaand P sexaurelia

Johri et al doi101093molbevmsx074 MBE

1198Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 2: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

has been especially difficult in microbes due to frequent dis-cordance between morphological and phylogenetic speciesconcepts and difficulties in performing mating tests underprecise laboratory conditions (Lahr et al 2014 Pawlowskiet al 2012) For instance many lineages in ciliates havebeen shown to harbor a number of cryptic species that aremorphologically identical but reproductively isolated(Foissner et al 2008 Krenek et al 2015) Some cryptic speciescomplexes in ciliated protozoans include the P aurelia com-plex (Sonneborn 1975) Halteria grandinella (Katz et al 2005)and Tetrahymena pyriformis (Gruchy 1955) Second the sam-pling of a few selected loci does not provide a true represen-tation of genetic diversity across the genome and this may inpart be responsible for contradictory reports of extremelyhigh or low nucleotide diversity in Paramecium (Hori et al2006 Snoke et al 2006 Catania et al 2009) Paramecium isone of the most well-studied ciliates and cryptic species inthe Paramecium aurelia complex have now been extensivelycharacterized and resolved (Sonneborn 1975 Beale and Preer2008) making it an ideal candidate in which to characterizegenome-wide variation

Like all ciliates Paramecium species possess two kinds ofnuclei a macronucleus (MAC) and at least one micronu-cleus (MIC) (Prescott 1994) The MIC harbors a diploidtranscriptionally silent germline genome while the MACharbors a streamlined polyploid version of the MIC genomeand is transcriptionally active (Chalker and Yao 2011) As aresult of the massive genome rearrangements and sequenceexcisions that occur during MAC formation (Betermier2004) the Paramecium MAC genome is relatively small(35ndash70 Mb) and almost entirely devoid of repetitiveDNA In addition because P aurelia species can undergoautogamy (self-fertilization) their genomes can be reducedto a homozygous state prior to genomic sequencing Thesefeatures allow for high-quality reference genome assemblies(McGrath et al 2014a 2014b) More interestingly theParamecium macronuclear genome is extremely compactwith one of the shortest known average intron lengths (20ndash25 bps) (Russell et al 1994) and intergenic regions compris-ing an average of 100ndash200 bps The following study thereforeelucidates the forces shaping genetic variation across one ofthe most compact eukaryotic genomes

Another fascinating feature of the P aurelia complex isthat despite the simplicity of their genomes the species di-verged following at least two rounds of whole-genome dupli-cations (WGDs) that are not shared by the outgroup speciesand 50 of genes duplicated by the most recent WGD(320 MYA) are still maintained in duplicate copies (Auryet al 2006 McGrath et al 2014b) The Paramecium aureliacomplex thus has extremely gene-rich genomes (40000protein coding genes in each species) and provides an excel-lent platform for studying the evolutionary forces responsiblefor the retention or loss of genes after WGD

We thus aim to characterize genome-wide patterns ofvariation in the macronucleus of five deeply diverging speciesbelonging to the genus Paramecium three species belongingto the Paramecium aurelia complex (P tetraurelia P biaur-elia P sexaurelia) and two of the closest outgroups to the

complex (P caudatum and P multimicronucleatum) Thispopulation-genomic study involves resequencing whole ge-nomes of 10ndash13 isolates of each species sampled worldwideThe study provides insights about the global populationstructure of each species and the genome-wide diversity levelsacross different sites and genomic elements Heterozygositylevels are used to shed light on constraints on noncodingregions as well as individual genes This study provides aunique and extensive dataset to understand genomic varia-tion across five separate free-living microbial eukaryotes inboth the macronucleus and the mitochondrion and willserve as a valuable resource for future studies of the compar-ative and population genomics of Paramecium

Results and Discussion

Genetic Variation in ParameciumWe sequenced the macronuclear genomes of 10ndash13 isolatesof each of the five species P tetraurelia P biaurelia P sex-aurelia P caudatum and P multimicronucleatum (totaling 56individuals) attempting to sample as globally as possible (de-tails in supplementary table S1 Supplementary Material on-line) Reads were mapped (see Methods for details) topreviously assembled and annotated reference genomes foreach species (Marinov et al in preparation McGrath et al2014a 2014b) Most strains mapped to 85-97 of the refer-ence genome with the exception of two isolates of P cauda-tum and four isolates of P multimicronucleatum where asmaller fraction (50ndash70) mapped to the reference genome(supplementary table S1 Supplementary Material online)Isolates mapping tolt 75 of the reference genome are here-after referred to as divergent isolates We thus conductedeach of our analyses in P caudatum and P multimicronuclea-tum with two data sets one including the set of all sampledstrains and a subset (one excluding the divergent isolates)

We identified a total of 545750 (P tetraurelia) 948377 (Pbiaurelia) 3013676 (P sexaurelia) 4812175 (P caudatum)and 4333282 (P multimicronucleatum) single-nucleotidepolymorphisms (SNPs) representing about 07 13 47179 and 140 of their genomes respectively When thedivergent isolates are excluded there is a significant reductionin the number of SNPs identified in P caudatum and P multi-micronucleatum to 2218282 and 2502368 respectively inboth cases comprising 74 of the genome

Population Structure of ParameciumAll Paramecium species in this study are cosmopolitan andfound in freshwater bodies across varying habitatsclimatezones (Przybos 2005 Przybos et al 2008 Przybos andSurmacz 2010) Three different hypotheses might explainsuch global distributions First the ubiquity model or ldquoevery-thing is everywhererdquo hypothesis suggests that high and con-tinuous dispersal rates of microbes would result in their globaldistribution with no genetic isolation by distance (Fencheland Finlay 2004 Finlay 2002) Paramecium has been proposedto disperse from one expanse of water to another by insectsbirds and larger migrating animals (Maguire and Belk 1967)including artificial dispersal by human activities (Foissner

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1195Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

2011) Alternatively it is possible that the rates of dispersal arenot very high but rare events of migration over a long periodof time have resulted in cosmopolitan distribution (Foissner2006) Lastly it is possible that the older Paramecium speciesP multimicronucleatum P caudatum and P sexaurelia(McGrath et al 2014b) were distributed worldwide beforethe continents separated and therefore achieved a global dis-tribution without extensive recent migration (Foissner 2008Weisse 2008) Such scenarios are plausible given thatParamecia do not form cysts (Sonneborn 1975) and thustheir long-distance dispersal capabilities might be limited Aconsequence of either of the latter two scenarios would bestrong subdivision among geographically distant populationsIn principle understanding the worldwide population struc-ture of Paramecium would provide more insight into theongoing debate of whether unicellular eukaryote populationshave biogeographical patterns similar to those of many multi-cellular organisms (Foissner 2008 Foissner et al 2008)

To gain more insight into this issue we built phylogenetictrees using 883 single-copy orthologous protein-coding genesthat are present in all species For each orthologous gene setall sequences (derived from each isolate and reference ge-nomes) were concatenated into one supermatrix andRAxML8026 (Stamatakis 2014) was used to generate a phy-logenetic tree under the substitution model GTRGAMMA(fig 1) We also generated gene trees for each of the orthol-ogous gene sets separately and obtained a consensus tree(majority rule) (supplementary fig S1 SupplementaryMaterial online) (Aberer et al 2010) Both trees exhibit verysimilar topologies and all species are monophyletic The onlyexception is one strain assigned to P biaurelia (B_A Poland)that appears to cluster within the P tetraurelia clade and forwhich the sequencing data map to 97 (34 coverage) ofthe P tetraurelia reference genome We therefore excluded thismost likely misidentified strain from all subsequent analyses

P caudatum consists of two major clades (A and B genome-wide Fstfrac14 038) and two outgroup strains which represent thetwo divergent isolates noted above The two major clades Aand B which do not assort geographically represent the twohaplotype groups that P caudatum has been suggested to par-tition into previously (Barth et al 2006 Hori et al 2006 Kreneket al 2015) Interestingly 92 of genes support the monophylyof Clade A whereas only 77 of genes support the monophylyof Clade B (supplementary fig S1 Supplementary Material on-line) possibly suggesting unidirectional gene flow between thetwo clades Similarly P multimicronucleatum partitions into twomajor clades (C and D genome-wide Fstfrac14 050) and all fourdivergent isolates form clade D

P caudatum subset (Clade A and B) and P multimicronu-cleatum subset (Clade C) seem to be comprised of individualsthat can successfully mate Multiple lines of evidence havesuggested that clades A and B are not reproductively isolated(Hiwatashi 1968 Tsukii and Hiwatashi 1983 1985 Stoeck et al2000 Hori et al 2006) and are not different species accordingto the biological species concept Although all strains used inthe study were identified via mating tests conducted in pre-vious studies the extent of reproductive isolation betweenclades C and D is not entirely clear We therefore consider

both P caudatum and P multimicronucleatum subsets to becomprised of a conservative set of individuals that are notreproductively isolated

Remarkably in none of the species do isolates appear togroup together in accordance with geographical location sug-gesting the absence of strong geographic population struc-ture To explore this issue we performed principlecomponent analysis (PCA) with approximately 10000 uni-formly distributed SNPs for each of the five species separately(fig 2 supplementary fig S2 Supplementary Material online)In general the genetic distances between isolates do not ap-pear to correlate with geography with the exception of Pbiaurelia wherein the PCA plots suggest a USA-Europe sepa-ration and mirror geography mildly within Europe In addi-tion there is a lack of any significant correlation by linearregression between pairwise geographical and genetic dis-tance between isolates within each species (fig 2 supplementary fig S2 Supplementary Material online)

Conversely we find that any two isolates sampled fromexactly the same lake or city are always closely related onphylogenies and are indistinguishable by PCA This can beseen in P biaurelia (B_256-UB4 and B_256-UB2 from USABoston) and P sexaurelia (S_133 and S_132 from SpainSeville S_129 and S_130 from Greece Loannina Lake)None of these pairs of isolates are identical with 117936896 and 3736 bases (000058 000015 and 000010 di-vergence per site) at which their genotypes differ respectively

Thus overall within the limitations of our sampling we donot observe any clear population structure correlated withgeography among any of the investigated Paramecium spe-cies suggesting that these ciliates are effectively ubiquitouslydistributed and hence provide support for the ldquocosmopolitanmodelrdquo or the ldquoeverything is everywhererdquo hypothesis (Finlay2002 Finlay et al 2004) As an exception P biaurelia mightmildly support the ldquomoderate endemicity modelrdquo (Foissner2006 2008 1999) Although many microbial species have nowbeen shown to exhibit biogeographies (reviewed in Foissner2006) no such patterns are observed on the continental scalein the examined Paramecium species Finally the scenariothat the worldwide distribution of these species occurredonly before the split of Pangea is strongly refuted Our findingsrather suggest recent or ongoing migration and gene flow ofParamecium across the globe supporting observations fromprevious studies (Przybos et al 2008 Catania et al 2009Krenek et al 2012) Interestingly although there appears tobe a correlation between physical and genetic distance on avery small geographical scale this correlation disappearswhen evaluated globally across continents This could be aresult of very different mechanisms and rates of dispersal atshorter versus larger distances for instance via organismalvectors (short-distance carriers) and global atmospheric cir-culation or human-mediated transport (long-distance dis-persal) It is also possible that autogamy in the P aureliaspecies plays a role in reducing genetic variation within pop-ulations and increasing genetic differentiation between pop-ulations as observed in self-fertilizing plant species (Lovelessand Hamrick 1984 Abbott and Gomes 1989 Sun 1999Charlesworth 2003)

Johri et al doi101093molbevmsx074 MBE

1196Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Nucleotide Diversity across the Nuclear GenomeNucleotide diversity (measured by p) at 4-fold degenerate sitesacross the genome is extremely high in both P caudatum(0135) and P multimicronucleatum (0159) but much lowerfor the P aurelia species 0027 for P sexaurelia 0009 for Pbiaurelia and 0006 for P tetraurelia Upon excluding the twodivergent strains in P caudatum the nucleotide diversity at 4-

fold degenerate sites halves to about 0069 Similarly uponexcluding the four divergent strains in P multimicronucleatumthe diversity levels fall to 0074 Table 1 shows the nucleotidediversity across different site-types intronic coding (4-fold de-generate 2-fold degenerate and 0-fold degenerate) and inter-genic regions (noncoding regions that extend from the startcodon of a gene to the startstop codon of the adjacent gene)

FIG 1 A concatenated tree of all single-copy gene sequences (maximum likelihood GTRGAMMA) in five species P tetraurelia (green T) Pbiaurelia (blue B) P sexaurelia (orange S) P caudatum (red C) and P multimicronucleatum (purple M) shows that all species are monophyleticexcluding one strain of P biaurelia (B_A) that branches with P tetraurelia Clade A and B together constitute P caudatum subset and Clade Crepresents the P multimicronucleatum subset Isolates marked with represent the divergent isolates that mapped tolt75 of the referencegenome The numbers represent maximum likelihood estimates of branch lengths (mean number of substitutions per site)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1197Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Expected heterozygosity in introns increases significantlyacross all species upon exclusion of the first three and lastthree sites within each intron indicating that these sites areconserved presumably for splicing Overall diversity inintronic and intergenic regions is remarkably low comparedto the diversity levels at 4-fold degenerate sites for all species(also seen in supplementary fig S3 Supplementary Material

online) suggesting that the non-coding regions inParamecium are under strong selective constraints most likelybecause they harbor functionally important splice sites andregulatory modules respectively Interestingly while nucleotidediversity levels in the P aurelia species are much lower inintrons relative to intergenic sequences this trend is reversedin the two outgroups This observation is consistent with the

FIG 2 Population structure of Paramecium species Left Principal component analysis using genome-wide genotypes for P tetraurelia P biaureliaand P sexaurelia Right Average pairwise genetic distance as a function of physical distance between individuals within P tetraurelia P biaureliaand P sexaurelia

Johri et al doi101093molbevmsx074 MBE

1198Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 3: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

2011) Alternatively it is possible that the rates of dispersal arenot very high but rare events of migration over a long periodof time have resulted in cosmopolitan distribution (Foissner2006) Lastly it is possible that the older Paramecium speciesP multimicronucleatum P caudatum and P sexaurelia(McGrath et al 2014b) were distributed worldwide beforethe continents separated and therefore achieved a global dis-tribution without extensive recent migration (Foissner 2008Weisse 2008) Such scenarios are plausible given thatParamecia do not form cysts (Sonneborn 1975) and thustheir long-distance dispersal capabilities might be limited Aconsequence of either of the latter two scenarios would bestrong subdivision among geographically distant populationsIn principle understanding the worldwide population struc-ture of Paramecium would provide more insight into theongoing debate of whether unicellular eukaryote populationshave biogeographical patterns similar to those of many multi-cellular organisms (Foissner 2008 Foissner et al 2008)

To gain more insight into this issue we built phylogenetictrees using 883 single-copy orthologous protein-coding genesthat are present in all species For each orthologous gene setall sequences (derived from each isolate and reference ge-nomes) were concatenated into one supermatrix andRAxML8026 (Stamatakis 2014) was used to generate a phy-logenetic tree under the substitution model GTRGAMMA(fig 1) We also generated gene trees for each of the orthol-ogous gene sets separately and obtained a consensus tree(majority rule) (supplementary fig S1 SupplementaryMaterial online) (Aberer et al 2010) Both trees exhibit verysimilar topologies and all species are monophyletic The onlyexception is one strain assigned to P biaurelia (B_A Poland)that appears to cluster within the P tetraurelia clade and forwhich the sequencing data map to 97 (34 coverage) ofthe P tetraurelia reference genome We therefore excluded thismost likely misidentified strain from all subsequent analyses

P caudatum consists of two major clades (A and B genome-wide Fstfrac14 038) and two outgroup strains which represent thetwo divergent isolates noted above The two major clades Aand B which do not assort geographically represent the twohaplotype groups that P caudatum has been suggested to par-tition into previously (Barth et al 2006 Hori et al 2006 Kreneket al 2015) Interestingly 92 of genes support the monophylyof Clade A whereas only 77 of genes support the monophylyof Clade B (supplementary fig S1 Supplementary Material on-line) possibly suggesting unidirectional gene flow between thetwo clades Similarly P multimicronucleatum partitions into twomajor clades (C and D genome-wide Fstfrac14 050) and all fourdivergent isolates form clade D

P caudatum subset (Clade A and B) and P multimicronu-cleatum subset (Clade C) seem to be comprised of individualsthat can successfully mate Multiple lines of evidence havesuggested that clades A and B are not reproductively isolated(Hiwatashi 1968 Tsukii and Hiwatashi 1983 1985 Stoeck et al2000 Hori et al 2006) and are not different species accordingto the biological species concept Although all strains used inthe study were identified via mating tests conducted in pre-vious studies the extent of reproductive isolation betweenclades C and D is not entirely clear We therefore consider

both P caudatum and P multimicronucleatum subsets to becomprised of a conservative set of individuals that are notreproductively isolated

Remarkably in none of the species do isolates appear togroup together in accordance with geographical location sug-gesting the absence of strong geographic population struc-ture To explore this issue we performed principlecomponent analysis (PCA) with approximately 10000 uni-formly distributed SNPs for each of the five species separately(fig 2 supplementary fig S2 Supplementary Material online)In general the genetic distances between isolates do not ap-pear to correlate with geography with the exception of Pbiaurelia wherein the PCA plots suggest a USA-Europe sepa-ration and mirror geography mildly within Europe In addi-tion there is a lack of any significant correlation by linearregression between pairwise geographical and genetic dis-tance between isolates within each species (fig 2 supplementary fig S2 Supplementary Material online)

Conversely we find that any two isolates sampled fromexactly the same lake or city are always closely related onphylogenies and are indistinguishable by PCA This can beseen in P biaurelia (B_256-UB4 and B_256-UB2 from USABoston) and P sexaurelia (S_133 and S_132 from SpainSeville S_129 and S_130 from Greece Loannina Lake)None of these pairs of isolates are identical with 117936896 and 3736 bases (000058 000015 and 000010 di-vergence per site) at which their genotypes differ respectively

Thus overall within the limitations of our sampling we donot observe any clear population structure correlated withgeography among any of the investigated Paramecium spe-cies suggesting that these ciliates are effectively ubiquitouslydistributed and hence provide support for the ldquocosmopolitanmodelrdquo or the ldquoeverything is everywhererdquo hypothesis (Finlay2002 Finlay et al 2004) As an exception P biaurelia mightmildly support the ldquomoderate endemicity modelrdquo (Foissner2006 2008 1999) Although many microbial species have nowbeen shown to exhibit biogeographies (reviewed in Foissner2006) no such patterns are observed on the continental scalein the examined Paramecium species Finally the scenariothat the worldwide distribution of these species occurredonly before the split of Pangea is strongly refuted Our findingsrather suggest recent or ongoing migration and gene flow ofParamecium across the globe supporting observations fromprevious studies (Przybos et al 2008 Catania et al 2009Krenek et al 2012) Interestingly although there appears tobe a correlation between physical and genetic distance on avery small geographical scale this correlation disappearswhen evaluated globally across continents This could be aresult of very different mechanisms and rates of dispersal atshorter versus larger distances for instance via organismalvectors (short-distance carriers) and global atmospheric cir-culation or human-mediated transport (long-distance dis-persal) It is also possible that autogamy in the P aureliaspecies plays a role in reducing genetic variation within pop-ulations and increasing genetic differentiation between pop-ulations as observed in self-fertilizing plant species (Lovelessand Hamrick 1984 Abbott and Gomes 1989 Sun 1999Charlesworth 2003)

Johri et al doi101093molbevmsx074 MBE

1196Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Nucleotide Diversity across the Nuclear GenomeNucleotide diversity (measured by p) at 4-fold degenerate sitesacross the genome is extremely high in both P caudatum(0135) and P multimicronucleatum (0159) but much lowerfor the P aurelia species 0027 for P sexaurelia 0009 for Pbiaurelia and 0006 for P tetraurelia Upon excluding the twodivergent strains in P caudatum the nucleotide diversity at 4-

fold degenerate sites halves to about 0069 Similarly uponexcluding the four divergent strains in P multimicronucleatumthe diversity levels fall to 0074 Table 1 shows the nucleotidediversity across different site-types intronic coding (4-fold de-generate 2-fold degenerate and 0-fold degenerate) and inter-genic regions (noncoding regions that extend from the startcodon of a gene to the startstop codon of the adjacent gene)

FIG 1 A concatenated tree of all single-copy gene sequences (maximum likelihood GTRGAMMA) in five species P tetraurelia (green T) Pbiaurelia (blue B) P sexaurelia (orange S) P caudatum (red C) and P multimicronucleatum (purple M) shows that all species are monophyleticexcluding one strain of P biaurelia (B_A) that branches with P tetraurelia Clade A and B together constitute P caudatum subset and Clade Crepresents the P multimicronucleatum subset Isolates marked with represent the divergent isolates that mapped tolt75 of the referencegenome The numbers represent maximum likelihood estimates of branch lengths (mean number of substitutions per site)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1197Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Expected heterozygosity in introns increases significantlyacross all species upon exclusion of the first three and lastthree sites within each intron indicating that these sites areconserved presumably for splicing Overall diversity inintronic and intergenic regions is remarkably low comparedto the diversity levels at 4-fold degenerate sites for all species(also seen in supplementary fig S3 Supplementary Material

online) suggesting that the non-coding regions inParamecium are under strong selective constraints most likelybecause they harbor functionally important splice sites andregulatory modules respectively Interestingly while nucleotidediversity levels in the P aurelia species are much lower inintrons relative to intergenic sequences this trend is reversedin the two outgroups This observation is consistent with the

FIG 2 Population structure of Paramecium species Left Principal component analysis using genome-wide genotypes for P tetraurelia P biaureliaand P sexaurelia Right Average pairwise genetic distance as a function of physical distance between individuals within P tetraurelia P biaureliaand P sexaurelia

Johri et al doi101093molbevmsx074 MBE

1198Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 4: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Nucleotide Diversity across the Nuclear GenomeNucleotide diversity (measured by p) at 4-fold degenerate sitesacross the genome is extremely high in both P caudatum(0135) and P multimicronucleatum (0159) but much lowerfor the P aurelia species 0027 for P sexaurelia 0009 for Pbiaurelia and 0006 for P tetraurelia Upon excluding the twodivergent strains in P caudatum the nucleotide diversity at 4-

fold degenerate sites halves to about 0069 Similarly uponexcluding the four divergent strains in P multimicronucleatumthe diversity levels fall to 0074 Table 1 shows the nucleotidediversity across different site-types intronic coding (4-fold de-generate 2-fold degenerate and 0-fold degenerate) and inter-genic regions (noncoding regions that extend from the startcodon of a gene to the startstop codon of the adjacent gene)

FIG 1 A concatenated tree of all single-copy gene sequences (maximum likelihood GTRGAMMA) in five species P tetraurelia (green T) Pbiaurelia (blue B) P sexaurelia (orange S) P caudatum (red C) and P multimicronucleatum (purple M) shows that all species are monophyleticexcluding one strain of P biaurelia (B_A) that branches with P tetraurelia Clade A and B together constitute P caudatum subset and Clade Crepresents the P multimicronucleatum subset Isolates marked with represent the divergent isolates that mapped tolt75 of the referencegenome The numbers represent maximum likelihood estimates of branch lengths (mean number of substitutions per site)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1197Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Expected heterozygosity in introns increases significantlyacross all species upon exclusion of the first three and lastthree sites within each intron indicating that these sites areconserved presumably for splicing Overall diversity inintronic and intergenic regions is remarkably low comparedto the diversity levels at 4-fold degenerate sites for all species(also seen in supplementary fig S3 Supplementary Material

online) suggesting that the non-coding regions inParamecium are under strong selective constraints most likelybecause they harbor functionally important splice sites andregulatory modules respectively Interestingly while nucleotidediversity levels in the P aurelia species are much lower inintrons relative to intergenic sequences this trend is reversedin the two outgroups This observation is consistent with the

FIG 2 Population structure of Paramecium species Left Principal component analysis using genome-wide genotypes for P tetraurelia P biaureliaand P sexaurelia Right Average pairwise genetic distance as a function of physical distance between individuals within P tetraurelia P biaureliaand P sexaurelia

Johri et al doi101093molbevmsx074 MBE

1198Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 5: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Expected heterozygosity in introns increases significantlyacross all species upon exclusion of the first three and lastthree sites within each intron indicating that these sites areconserved presumably for splicing Overall diversity inintronic and intergenic regions is remarkably low comparedto the diversity levels at 4-fold degenerate sites for all species(also seen in supplementary fig S3 Supplementary Material

online) suggesting that the non-coding regions inParamecium are under strong selective constraints most likelybecause they harbor functionally important splice sites andregulatory modules respectively Interestingly while nucleotidediversity levels in the P aurelia species are much lower inintrons relative to intergenic sequences this trend is reversedin the two outgroups This observation is consistent with the

FIG 2 Population structure of Paramecium species Left Principal component analysis using genome-wide genotypes for P tetraurelia P biaureliaand P sexaurelia Right Average pairwise genetic distance as a function of physical distance between individuals within P tetraurelia P biaureliaand P sexaurelia

Johri et al doi101093molbevmsx074 MBE

1198Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 6: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

fact that on average intergenic regions in the P aurelia speciesare longer A possible explanation is the nonfunctionalizationof genes following the two WGDs which would increase theamount of non-functional DNA between the two flankingfunctional genes and thus reduce the fraction of intergenicsites under constraints Among the P aurelia species P sex-aurelia shows much higher genetic variation across the ge-nome consistent with previous studies (Przybos et al 2007)that found significant genotypic differences between P sexaur-elia strains This is also consistent with relatively higher diver-gence of P sexaurelia (McGrath et al 2014b) suggesting thatthe differences in genome-wide levels of sequence variationcould be driven by slight differences in mutation rate

Overall 4-fold degenerate sites have the highest levels ofnucleotide diversity across all elements of the genome for allspecies and thus might represent the most neutrally evolvingsites in the Paramecium macronucleus The diversity valuesreported for P aurelia species fall within the range of knowndiversity levels for other unicellular eukaryotes such asPlasmodium falciparum (4-fold degenerate pfrac14 00015)(Volkman et al 2007) Chlamydomonas reinhardtii (silentsite pfrac14 00319) (Smith and Lee 2008) and Tetrahymena ther-mophila (coding pfrac14 00020) (Katz et al 2006) On the otherhand the values of 4-fold degenerate diversity in the out-group species P caudatum and P multimicronucleatum areamong the highest reported (Leffler et al 2012 Dey et al2013) It is still not entirely clear if such high diversity valuesare due to a large panmictic global population or because ofpopulation structure despite infrequent gene flow Samplingindividuals from the same pond will help dissect these twoscenarios However we cannot presently rule out other fac-tors such as the presence of cryptic species among the out-groups introgression with closely-related species or muchhigher intrinsic mutation rate that could also potentiallycontribute to these remarkably high levels of variation

Intergenic Regions in the Outgroups are Under StrongConstraintsThe macronuclear genomes of all Paramecium species areextremely compact (fig 3a and 3b) with the median inter-genic region (non-coding DNA between start and startstopsites of adjacent genes) in P caudatum and P multimicronu-cleatum comprising 43 and 93 bp respectively The aureliaspecies have slightly longer intergenic regions (median of161 bp 195 bp and 229 bp in P tetraurelia P biaurelia Psexaurelia respectively) Paramecium also has one of the

shortest eukaryotic introns with average size of just 20ndash25base pairs (Russell et al 1994) The internally eliminated se-quences (IESs) that are excised during the process of theformation of the macronucleus and are well characterizedonly in P tetraurelia have a median length of 26ndash28 bp(Arnaiz et al 2012) In fact 93 of the IESs in P tetraureliaare shorter than 150 bp (Arnaiz et al 2012) suggesting thatthe micronuclear genomes of the Paramecium species arealso fairly compact In most organisms diversity in codingregions is well below the diversity in intergenic and intronicregions For instance in S cerevisiae which has a fairly com-pact genome the non-coding to coding diversity ratio is 123(Schacherer et al 2009) In S pombe the coding heterozygos-ity is 02122 and diversity at intergenic regions (including 50

and 30 UTRs) is 03351 resulting in the intergenic to codingdiversity ratio of 158 (calculated from Fawcett et al 2014)

In the P aurelia species nucleotide diversity in codingregions is50 of the intergenic diversity while for the out-group species intergenic regions have much less diversitythan the coding regions (table 1) In order to reduce biasesthat might be the result of differences in sequence depth ofcoverage across different sites we compared expected heter-ozygosity at coding and intergenic regions using only sitesthat were covered by all individuals and had coverage veryclose to the median population coverage (620 reads) (fig 3c)While the ratio of intergenic to coding diversity in the Paurelia species is comparable to that in other organismsmdash166 (P tetraurelia) 164 (P biaurelia) and 169 (P sexaurelia)in the outgroup species the ratio is 058 [subset 074] for Pcaudatum and 080 [subset 075] for P multimicronucleatumHereafter the values in square brackets correspond to those be-longing to the subsets We know of no other species with suchlow nucleotide diversity in intergenic relative to coding regionssuggesting strong conservation of intergenic sequence

Despite accounting for bias in coverage it is possible that ifthe reference genome is highly diverged from the sequencedindividuals (as in our case) say at intergenic regions thosereads might fail to map and hence result in a lower estimationof intergenic diversity We therefore performed appropriatesimulations to address mapping and SNP detection when thereference genome is highly diverged from individuals in thepopulation Our simulations show that up to 10 divergencefrom the reference SNP detection is remarkably accurate At10 divergence from the reference 93 of true SNPs can bedetected and estimated divergence (95) is only slightlylower than true divergence (10) (supplementary fig S4 sup

Table 1 Mean Nucleotide Diversity (per site) Estimated for All Isolates across the Genome (SElt 30 10 4 for all sites) Intron RepresentsIntrons Excluding the First Three and Last Three Sites Intergenic Region Refers to the Noncoding DNA that Extends between the Stop and StartSite of Protein-Coding Genes

Species 4-fold 2-fold 0-fold syn non-syn intron Intron intergenic avg coding

P tetraurelia 00058 00033 00012 00042 00011 00026 00032 00045 00023P biaurelia 00089 00049 00021 00059 00018 00049 00050 00071 00035P sexaurelia 00269 00165 00061 00194 00052 00151 00167 00209 00113P caudatum 01352 00819 00150 01056 00134 00466 00602 00350 00472P caudatum (subset) 00693 00381 00090 00518 00080 00316 00406 00262 00241P multimicronucleatum 01599 00781 00216 01146 00193 00405 00492 00337 00530P multimicronucleatum (subset) 00742 00418 00100 00549 00089 00249 00330 00201 00259

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1199Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 7: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

plementary table S4 Supplementary Material online)However reads obtained from genomes that were morethan 10 divergent from the reference failed to map success-fully and there is a steep reduction in the fraction of trueSNPs detected with only10 of the true SNPs detected at20 divergence from the reference (supplementary fig S4Supplementary Material online) Interestingly most of ourstrains in P caudatum are far below 10 divergence (includ-ing coding and non-coding regions separately) (supplementary table S5 Supplementary Material online) and thus over-all analyses across majority of the genome should not beaffected by this bias In P multimicronucleatum althoughthe divergent strains are close to being 10 divergent fromthe reference (supplementary table S5 SupplementaryMaterial online) when we consider only the P multimicronu-cleatum subset all strains are less than 6 divergent from thereference Therefore it is possible that such a bias is affectingour results when we include all the strains in the outgroupspecies but analyses with the subsets should be minimallyinfluenced by this bias Nonetheless it should be noted that

due to heterogeneity of divergence across the genome someregions of the genome might still be affected by such a map-ping bias Accordingly we repeated the above analysis bycalling SNPs using only the 4 and 5 closest individuals tothe reference genome (forming polytomies in supplementaryfig S1 Supplementary Material online) in P caudatum and Pmultimicronucleatum respectively These individuals wereeven closer to the reference genome with average expectedheterozygosity at 4-fold degenerate sites of 004302 and006255 respectively Again we observe that for all sites thatare covered by all individuals and have coverage very close tothe median population coverage (620 reads) the ratio ofintergenic to coding diversity was 078 and 072 respectivelyThis suggests that the observation of relatively lower geneticvariation in the intergenic relative to coding regions in theoutgroup species is not an artifact of our analysis

As an alternative approach to examine the relativestrength of selective constraint on coding versus non-coding DNA we examined the distribution of insertionsde-letions (indels) across the genome A total of 29096 (P

FIG 3 The compactness of Paramecium genomes (A B) Distribution of intron and intergenic length across Paramecium genomes (A) Ptetraurelia and (B) P caudatum Mean intron lengths in P tetraurelia and P caudatum 25 and 22 bp median intergenic length in P tetraureliaand P caudatum 161 and 43 bp (C) pcodingpintergenic calculated genome-wide from sites that had coverage very close to the median populationcoverage (620 reads) and were contributed by the same number of individuals (D) Incidence of indels across the genome for each Parameciumspecies calculated by dividing the total number of indels found in that region by the total number of utilizable nucleotide sites in that region

Johri et al doi101093molbevmsx074 MBE

1200Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 8: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

tetraurelia) 54406 (P biaurelia) 88724 (P sexaurelia) 32631[28776] (P caudatum [subset]) and 31338 [28347] (P multi-micronucleatum [subset]) indels were found across the ge-nomes which translates to SNPindel ratios of 19 17 34147 [77] and 138 [88] respectively The indel lengths areexponentially distributed in all species with a median of1 bp Indels are heavily under-represented in coding regions(fig 3d) where the median indel length is 3 bp suggestingpurifying selection against indels that cause frame-shiftsIntrons and intergenic regions however tend to have a me-dian indel length of 1 bp (supplementary table S6Supplementary Material online) Again we see that in theP aurelia species there are proportionally more indels presentin the intergenic regions than in introns On the other handin the outgroup species intergenic regions tolerate fewerindels than do intronic regions This shows that the very shortintergenic regions in P caudatum and P multimicronuclea-tum are under extremely strong constraint

Spatial Patterns of Constraints across Non-CodingRegions in the GenomesSelection in natural populations can remove strongly delete-rious mutations but weakly deleterious mutations can con-tinue to persist in the population at lower allele frequencies(Ohta 1974 1992) Thus within-population nucleotide diver-sity can be used to infer constraints across the genome andpossibly identify functional parts of the non-coding genomeWe therefore evaluated spatial patterns of nucleotide diver-sity in exons flanking all introns and intergenic regions acrossthe genome for each species separately (fig 4 supplementaryfig S5 Supplementary Material online) However neutral siteslinked to functionally important sites also experience reduc-tion in diversity as a result of either purging of deleteriousalleles (background selection (Charlesworth et al 1993)) orfixation of a beneficial mutation (hitchhiking (Smith andHaigh 1974 Kaplan et al 1989)) at functional loci due tolinkage (Comeron et al 2008) As selection has a minimaleffect on divergence at linked neutral sites (Birky and Walsh1988) a comparison of polymorphism and divergence candistinguish between the effects of variation in the neutralmutation rate (which is determined by selective constraintin addition to mutational processes) versus linked selection(Begun and Aquadro 1992) To this end we evaluated pat-terns of divergence across introns and intergenic regions forthe relatively more closely related P aurelia species

We observed a reduction in both expected heterozygosityand divergence in intergenic regions near the start and end ofcoding sequences as well as at some key sites within introns(figs 4 and 5 and supplementary fig S6 SupplementaryMaterial online) In fact divergence-normalized values ofexpected heterozygosity fail to exhibit any spatial patternacross these non-coding elements (fig 5 supplementary figS6 Supplementary Material online) suggesting that thesepatterns are driven predominantly by negative selectionand mutational processes (although see below) rather thanlinked selection This is interesting because although asexualspecies are expected to experience the effects of linked selec-tion more strongly (Charlesworth et al 1993) very compact

genomes such as those in Paramecium might experiencerelatively more direct selection as they have a higher propor-tion of functional sites

The first and last three sites of introns are highly conservedeven more so than 0-fold degenerate sites observed in pat-terns of both divergence and nucleotide diversity (figs 4 and5) consistent with the role of these positions in intron splicing(Jaillon et al 2008) In fact it has been suggested that the firstand last three sites in short eukaryotic introns might be suf-ficient for accurate splicing (Lim and Burge 2001 Jaillon et al2008) However we find that in Paramecium on average 8ndash10 base pairs of coding DNA immediately flanking an intronare also under stronger constraints than the remainder ofcoding sequence possibly suggesting the presence of splicingregulatory elements in the coding sequence Another inter-esting observation is the clear decrease in divergence (fig 5eand 5g) in the center of introns suggesting that longer intronsmight possess functional elements related to gene regulation

We also observe that average expected heterozygosity anddivergence increases gradually moving upstream from thestart codon and downstream from the stop codon into theintergenic regions stabilizing at around 100ndash150 base pairs (fig4) suggesting the presence of regulatory modules or UTRs Allfour types of intergenic regions with respect to the directionof transcriptionmdash(i) regions that are upstream to both flank-ing genes (ii) those downstream of one gene and upstream ofanother (iii) upstream of one gene and downstream of an-other and (iv) downstream of both flanking genesmdashexhibitthe same pattern of reduction in heterozygosity levels justbefore the start and after the stop codons (supplementaryfig S7 Supplementary Material online) This indicates thatconstraints on downstream intergenic regions are not an ar-tifact of overlapping upstream intergenic regions Insteadthese observations suggest the presence of gene regulatoryelements at the 30 end of genes either involved in posttran-scriptional regulation in the specification of transcription ter-mination or in the regulation of gene expression We notethat in the P aurelia species diversity levels in the intergenicregions stabilize to reach average diversity at 4-fold degeneratesites In contrast in the two outgroup species diversity inintergenic regions stabilizes only up to the equivalent averagediversity at 2-fold degenerate sites Such differences in patternsof nucleotide diversity between the P aurelia species and theoutgroup species can be attributed to a much higher propor-tion of functionally important sites comprising the intergenicspace in the outgroup species possibly as a result of smallernumbers of bases between adjacent genes in the latter

A similar reduction in diversity in intergenic regions ex-tends up to 85 kb around transcription startstop sites ofgenes in humans (Abecasis et al 2010) In S cerevisiae inter-genic regions are under strong purifying selection and thoseshorter than 300 bp have significantly reduced nucleotide di-versity than longer intergenic regions (Schacherer et al 2009)

Interestingly we also observe on average a 70100 bp(2035 amino acids) decrease in diversity as well as diver-gence at both the 50 and 30 end of coding sequence (CDS) ofprotein coding genes Such reduction in diversity at 4-fold and2-fold degenerate sites at the beginning and end of protein-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1201Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 9: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

FIG 4 Average expected heterozygosity (per site) within and flanking introns (left) and intergenic regions (right) in P tetraurelia (a b) P biaurelia(c d) P sexaurelia (e f) P caudatum subset (g h) and P multimicronucleatum subset (i j) separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Johri et al doi101093molbevmsx074 MBE

1202Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 10: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

FIG 5 Patterns of average divergence (per site) within and flanking intergenic regions (a c) and introns (e g) are shown in the left panels Patternsof average expected heterozygosity per site normalized by average divergence per site within and flanking intergenic regions (b d) and introns (f h)are shown on right The divergence values are calculated between P tetraurelia and P biaurelia separately for 4-fold degenerate sites (red) 2-folddegenerate sites (blue) 0-fold degenerate sites (green) and non-coding sites (black)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1203Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 11: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

coding sequence can be explained by a number of differenthypotheses 50 ends of CDSs might be required to maintainlow thermodynamic stability of mRNA secondary structurenear the start codon for efficient translation initiationAlternatively CDS 50 ends might possess fewer tRNA-adapted codons to slow down translation initially whichhas been proposed to reduce ribosome collision-induced pre-mature translation terminations and facilitate protein pro-duction (see Tuller and Zur 2015) Consistent with this the50 and 30 ends of protein-coding genes in bacteria (3235codons) have been shown to be important for efficient trans-lation (Kudla et al 2009 Boel et al 2016) and their mRNAfolding energies can explain observed variation in expressionlevels (Kudla et al 2009) In fact it has been shown previouslythat the first few codons in proteins of P tetraurelia and Tthermophila are on average comprised of significantly differ-ent numbers of effective codons than the rest of the protein(Salim et al 2008) Thus the role played by the first and lastfew bases of all coding sequences in translation might con-tribute to additional sequence constraints resulting in re-duced nucleotide diversity and divergence

Another factor that might contribute to constraining theCDS ends is the presence of regulatory modules at the begin-ning and end of genes Exonic enhancers have been observedin multiple species including teleosts (Dong et al 2010Eichenlaub and Ettwiller 2011) humans (Birnbaum et al2012) and other vertebrates (Ritter et al 2012) where theyhave been shown to regulate nearby genes In the highlycompact genomes of Paramecium it is even more likelythat regulatory regions extend beyond non-coding regionsinto the CDS For instance in P caudatum median lengthof 50 intergenic regions is 43 bp which increases to 56 bpwhen both genes share the same 50 intergenic region sug-gesting the importance of upstream intergenic length in suchcompact genomes Interestingly 965 genes (7) haveshorter than 10 bp and 3417 genes (25) have shorterthan 20 bp long 50 intergenic regions in P caudatumSimilarly in P multimicronucleatum the median length of50 intergenic region is 90 bp which increases to 127 bpwhen both genes share the same 50 intergenic regionAgain there are 456 genes (35) with less than 10 bpand 1760 genes (13) with less than 20 bases comprisingtheir upstream intergenic region In fact in both species the50 intergenic region can be as small as 3 bp (found in 38 genesin P caudatum and 18 genes in P multimicronucleatum) Aninteresting question this posits is what might be the mini-mum number of necessary bases required to encode regula-tory information about transcription of genes in eukaryotesA recent study in yeast has now created a high-strength min-imal constitutive promoter that is only 116 bp in length(Redden and Alper 2015) It might be possible to observeeven smaller promoter regions needed to drive expressionpatterns in Paramecium especially in P caudatum

Spatial Patterns of Nucleotide Diversity across Non-Coding Regions are Not Driven by Mutation AloneAlthough we have assumed that the diversity-normalizeddivergence patterns indicate the effect of negative selection

it is possible that the observed spatial patterns of nucleotidediversity across and flanking non-coding regions are drivenentirely by mutation alone One way to potentially distinguishbetween mutation versus direct selection causing these pat-terns is to evaluate the allele frequency of the variants ob-served at different site types If the reduction in nucleotidediversity near the start and end of coding sequences is mostlya consequence of negative selection we would expect ahigher fraction of rare alleles at 4-fold degenerate sites understronger constraints relative to that seen at 4-fold degeneratesites not under constraints Instead if mutation is the pre-dominant driver of the extent of nucleotide diversity at thesesites then there should be no difference between the fractionof rare alleles between the 4-fold degenerate sites understronger negative selection and those not under constraintsBecause Tajimarsquos D (Tajima 1989) is a good statistic to sum-marize such a shift in rare alleles we compared averageTajimarsquos D statistic to evaluate this

In each Paramecium species we first characterized sitesgenome-wide into five different categoriesmdash(1) splice sites(first and last 3 sites) in introns (2) intergenic sites within50 bp of startstop sites (3) intergenic sites at least 150 bpaway from startstop sites (should exclude most UTR re-gions) (4) 4-fold degenerate sites within 50 bp of startstopsites and (5) 4-fold degenerate sites at least 150 bp awayfrom startstop sites and 30 bp away from splice sites Weexpected category 5 sites to be the most neutralInterestingly in most species Tajimarsquos D at splice sites inter-genic sites close to startstop and 4-fold degenerate sitesclose to startstop sites is significantly smaller than that of4-fold degenerate and intergenic sites far away from anyfunctional elements (see supplementary table S7Supplementary Material online for P-values) This suggeststhat sites in category 1 2 and 4 are under relatively strongerpurifying selection Interestingly category 2 sites seem topossess significantly more rare alleles than category 3 con-sistently in all species suggesting that intergenic regions veryclose to startstop sites are under stronger purifying selec-tion than intergenic sites further away from gene ends Wealso find Tajimasrsquo D at intergenic sites (Category 3) to besignificantly higher than that for 4-fold degenerate sites(Category 5) in most species This could either suggestthat intergenic sites are under some form of positivebal-ancing selection or that silent sites are not completely neu-tral and in fact experience some purifying selection

Gene-Based Analysis of Patterns of Synonymous andNon-Synonymous DiversitySelective constraint can vary among genes within the samespecies and it could also be species-specific To examine suchvariation in our dataset we evaluated the distribution of pnps

across all genes (fig 6a) If we assume that synonymous poly-morphisms are fitness neutral then pnps is a measure of therelative functional constraints on non-synonymous polymor-phisms and thus an indication of functional constraints on theprotein The majority of genes have pnpslt 10 consistentwith constraints on non-synonymous changes About 98of genes in P tetraurelia 96 in P biaurelia 96 in P

Johri et al doi101093molbevmsx074 MBE

1204Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 12: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

sexaurelia 99 in both P caudatum and its subset 99 inboth P multimicronucleatum and its subset have pnps lt 1Upon comparing the average pnps for the set of same orthol-ogous genes across all species we find the mean to be 010 forP tetraurelia 017 for P biaurelia 024 for P sexaurelia 012[012] for P caudatum [subset] and 017 [014] for P multi-micronucleatum [subset] Among these species P sexaureliahas the highest average pnps which either suggests relaxedselection because of relatively small effective population size ormore local adaptation Local adaptation might result in fixa-tion of more non-synonymous than synonymous differencesbetween subpopulations thus increasing pnps Differences inpnps between species could also reflect genome-wide varia-tion in the fraction of gene duplicates that are subfunctional-ized neofunctionalized or nonfunctionalized Alternatively anelevation in average levels of pnps could be a consequence ofdepression in the average levels of ps which could be causedby stronger selection on synonymous sites or a recent bottle-neck (Brandvain and Wright 2016) experienced by the speciesEvidence that genes with higher pnps ratios are under morerelaxed constraints is provided by a comparison between theaverage number of frame-shift causing indels occurring in the

four sets of genes with (1) pnps ltfrac14 025 (2) 025ltpnps ltfrac14 05 (3) 05ltpnps ltfrac14 10 and (3) pnpsgt 10 Asexpected we find an increase in the number of indels per sitewith increase in pnps (fig 6b)

Paralogs and Orthologs Evolve under Similar SelectiveConstraints within and across SpeciesWe evaluated the correlation between pnps across orthologsand within-species paralogs which were created by the mostrecent WGD (fig 7) Both paralogs and orthologs show asignificant positive correlation within and across all speciessuggesting that most WGD duplicates are under similar se-lective constraints We also confirmed that these correlationsare not being driven simply by selection on synonymous sitesas ps is not necessarily correlated between orthologs andwithin-species paralogs across all species (supplementary figS8 Supplementary Material online) Similar findings in themaize genome also suggested correlations between nucleo-tide diversity and divergence between WGD paralogs(Pophaly and Tellier 2015) Our results further suggest thatmost orthologs are under very similar evolutionary con-straints across the Paramecium species This is in agreement

FIG 6 (A) pn versus ps of each gene The red line represents pnfrac14ps (B) The distribution of pnps of each gene across the genome Only genes withmore than five synonymous polymorphisms were used in this analysis (C) Number of frame-shift causing indels per site in genes belonging to threegene categories (1) pnps gtfrac14 05 (light grey) (2) 05lt pnps ltfrac14 01 (grey) and (3) pnpsgt 10 (black) Only genes with more than 5 synonymouspolymorphisms are used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1205Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 13: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

with a recent study in Drosophila where pn and ps were foundto be strongly correlated between D melanogaster and Dsimulans (Langley et al 2012) Such a correlation was alsoobserved between orthologs in S cerevisiae and S paradoxus(Elyashiv et al 2010)

A possible mechanism governing the degree of selectiveconstraint acting on proteins is their level of expression(Zhang and Yang 2015) Non-synonymous mutations arethought to have more severe fitness consequences in highlyexpressed genes than in weakly expressed ones (Drummondet al 2005 Gout et al 2010) Indeed we find a strong negative

correlation between expression and pnps in all species (fig8) However only 6ndash13 of the variance is explained byexpression implying that additional factors may play agreater role in determining the degree of selective constraintacting on a gene Simple biochemical properties of proteinslike the proteinrsquos function tertiary structure and the numberof interacting partners might determine the proportion offunctionally important residues in the protein and in turngovern the degree of selection

Our observation that orthologous genes across suchdeeply diverged species experience similar degrees of selective

FIG 7 Correlation of logarithm of pnps for paralogs within species (three diagonal plots for P tetraurelia P biaurelia P sexaurelia) and fororthologs between species pairwise [non-diagonal plots with the plots below diagonal (outlined red) representing the first set of WGD paralogs(T1 B1 S1) and the upper half of the matrix (outlined blue) representing the second set of paralogs (T2 B2 S2)] The number in the top left cornerof each plot is the slope of the linear regression in cases where the regression is significant (Plt 001) Only genes with more than five synonymouspolymorphisms were used in this analysis The schematic tree on the bottom right corner shows the phylogenetic relationship between the speciesP tetraurelia (T) P biaurelia (B) P sexaurelia (S) P caudatum (C) and P multimicronucleatum (M)

Johri et al doi101093molbevmsx074 MBE

1206Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 14: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

constraint (fig 7) might imply a low rate of turnover of ex-pression protein structure function and number of interact-ing partners This finding may also be in part due to similarecologies and therefore similar selective pressures experi-enced by the Paramecium species examined here It may

thus be informative to compare closely related species thatlive in or have recently experienced different environmentsFor instance Elyashiv et al (2010) found significant differencesin levels of gene-specific purifying selection (pnps) in the sakepopulation of S cerevisiae in comparison to other populations

FIG 8 Negative correlation of pnps with expression levels of genes in four Paramecium species All correlations are significant with Plt 22 10 16 Only genes with more than five synonymous polymorphisms were used in this analysis

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1207Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 15: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

of S cerevisiae as well as S paradoxus In fact this deviation inselective constraints experienced by the sake population iscomplemented by differences in response to different stresses(Kvitek et al 2008) suggesting that changes in environmentmight be reflected in gene-specific constraints

Distribution of Tajimarsquos D across the GenomeContrary to genes under purifying selection genes under bal-ancing or positive selection might often be species-specificWe tested this hypothesis using Tajimarsquos D as a statisticwhich uses the difference between nucleotide diversity andthe number of segregating sites to detect deviations fromneutrality Because Tajimarsquos D is sensitive to population sizechanges and population structure (Simonsen et al 1995) weused an outlier approach We calculated Tajimarsquos D for non-overlapping 1kb windows across the genome (supplementaryfig S9 Supplementary Material online) and looked for thepresence of candidate genes in the outlier regions (025 ofthe two tails) The outlier regions with negative values ofTajimarsquos D might represent genes either under weak purifyingor positive selection (ie experiencing a recent sweep)whereas the ones with positive Tajimarsquos D might be underbalancing selection Interestingly both sets of outliers showan over-representation of genes involved in glycosylation gly-coprotein production and metabolism and carbohydratemetabolism (supplementary table S8 SupplementaryMaterial online) for all Paramecium species For each speciesthe huge functional overlap of genes in both tails suggests theimportant role of these processes in Paramecium growth Butmore importantly this further suggests similar selective pres-sures genome-wide acting on all Paramecium species exam-ined in this study These conclusions can however be biasedby gene length as shorter genes might not have sufficientpower for an accurate calculation of Tajimarsquos D Indeed theoutliers are all significantly longer than the rest of the genes inP tetraurelia (outliers 1567 bp non-outliers 1321 bp Pfrac14 47 10 13) P biaurelia (outliers 1681 bp non-outliers1372 bp Plt 22 10 16) and in P sexaurelia (outliers1860 bp non-outliers 1379 bp Plt 22 10 16) This biasis however much lower or not significant for P caudatum(outliers 1463 bp non-outliers 1383 bp Pfrac14 73 10 3)and P multimicronucleatum (outliers 1509 bp non-outliers1445 bp Pfrac14 019) possibly because of higher genome-widenucleotide diversity indicating that despite some bias theresults have true biological significance

Tajimarsquos D as a summary statistic of the site-frequencyspectrum can also indicate overall patterns of selection orrecent changes in demography Overall in most species thegenome-wide distribution of Tajimarsquos D is shifted towardspositive values with means 0862 for P tetraurelia 0504 forP biaurelia 0274 in P sexaurelia -0148 [0459] for P cauda-tum [subset] 1325 [0855] for P multimicronucleatum [sub-set] This could suggest pervasive local adaptation (orbalancing selection) a recent contraction in populationsize or a fairly ancient population subdivision in allParamecium species A previous study that sampled variationat 10 nuclear loci found that strains from different continentsshared the same alleles and interpreted that as evidence of

regular global dispersal of Paramecium from relatively fewpermanent locations (Catania et al 2009) This could alsobe a potential explanation for the observed increase in nu-cleotide diversity in comparison to the number of segregatingsites

Fraction of Heterozygous Sites across the GenomeOn average about 01 of sites in an individual were hetero-zygous within the P aurelia species whereas07 of sites inindividuals belonging to P caudatum and 11 of sites inisolates of P multimicronucleatum were found to be hetero-zygous (supplementary table S1 Supplementary Material on-line) This is consistent with the observation that the P aureliaspecies undergo autogamy whereas the outgroup species donot On evaluating the fraction of heterozygous sites in win-dows of 10 kb across the genome we find large blocks ofhomozygosity in the P aurelia genomes (supplementary figS10andashe Supplementary Material online) consistent with los-ing heterozygosity in the lab while possibly undergoing mul-tiple episodes of autogamy It can however be noted thatpeaks in heterozygous regions in the P aurelia genomes arehighly correlated between isolates unlike in P caudatum andP multimicronucleatum and might potentially reflect map-ping or assembly errors These heterozygous regions (summa-rized in supplementary Files S1ndashS3 Supplementary Materialonline) comprise a very small part 082ndash097 of the ge-nome and most importantly on exclusion do not affect anyof our analyses qualitatively

Variation across the Mitochondrial GenomeFew ciliate mitochondrial genomes have been sequenced un-til now most belonging to the genus Tetrahymena (Brunket al 2003 Burger et al 2000 Moradian et al 2007) Someothers include Paramecium tetraurelia (Pritchard et al 1990)Paramecium caudatum (Barth and Berendonk 2011) andOxytricha trifallax (Swart et al 2012) These studies have il-lustrated that ciliate mitochondrial genomes are linear andrange between 40 and 47 kb with 50 intronless genesHowever there has been no understanding of the levels ofsequence diversity across the entire mitochondrial genome

To obtain mitochondrial genomes for all species includedin this study we carried out de novo genome assembly usingthe whole-genome resequencing data for each strain andextracted mitochondrial contigs using the publishedParamecium mitochondrial genomes as reference (seeMethods for more details) Using this approach we derivedcomplete genome assemblies for multiple strains includingfrom species without a previously published mitochondrialgenome (P biaurelia P sexaurelia and P multimicronuclea-tum) A comprehensive analysis of these sequences will bepublished elsewhere (Johri et al in preparation) The mito-chondrial genome assembly from each isolate was annotatedseparately and the eight genes that were most ubiquitouslypresent in all individuals were concatenated and used toconstruct a phylogeny (see Methods) The mitochondrialphylogeny closely tracked the one based on nuclear se-quences (supplementary fig S11 Supplementary Materialonline)

Johri et al doi101093molbevmsx074 MBE

1208Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 16: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

One strain from each species was chosen as a referencegenome (the genome with the longest assembly) and readsfrom other individuals were mapped to the reference to callSNPs During the original DNA preparation efforts weremade to remove non-MAC DNA resulting in varyingamounts of mitochondrial DNA present in the extractsfrom each isolate Hence some individuals mapped with ex-tremely low coverage (4 to zero read depth) to the referencegenome especially after applying stringent filters to call SNPs(supplementary table S9 Supplementary Material online)Thus some individuals were excluded from this analysis in-cluding all P biaurelia isolates The subsequent analysis wastherefore restricted to four species

All of the mitochondrial genomes are about 40 kb inlength We identified 626 SNPs in P tetraurelia 5307 inP sexaurelia 4771 [3041] in P caudatum [subset] and 4600[1478] in P multimicronucleatum [subset] Within all speciesvery few indels were found with just 1 in P tetraurelia 9 inP sexaurelia and 15 [13] and 23 [9] in P caudatum [subset]and P multimicronucleatum [subset] respectively The nucle-otide diversity at 4-fold degenerate sites was found to be 0009in P tetraurelia 0169 in P sexaurelia 0127 [0093] inP caudatum [subset] and 0172 [0073] in P multimicronuclea-tum [subset] Nucleotide diversity across different sites showsthat rRNAs are the most conserved sequences in the ciliatemitochondrial genomes Transfer RNAs are the next mosthighly conserved with their diversity levels only slightly abovethose of replacement sites (table 2) Our estimates of diversityin the Paramecium mitochondria are higher than those of anumber of other unicellular eukaryotes studied so far P falcip-arum (psynfrac14 00012) (Joy et al 2003) and Chlamydomonasreinhardtii (psilentfrac14 00085) (Smith and Lee 2008)

We also recalculated nuclear nucleotide diversity (table 2)at 4-fold degenerate sites for the same set of individuals usedin the mitochondrial analysis and observed that the mito-chondrial to nuclear diversity ratio is close to 10 in all species(table 2) except P sexaurelia where the mitochondrial diver-sity is 57-fold higher than that in the nucleus On excludingP sexaurelia we obtain a strong correlation between 4-folddegenerate nucleotide diversity in the mitochondria and nu-cleus (slopefrac14 095 R2frac14 096) It should be noted that thereare not enough data points to make any inferences aboutstatistical significance However such a positive correlationbetween the nuclear and mitochondrial diversity suggeststhat mitochondrial diversity might scale with populationsize in ciliates bringing to light previous studies in animalsthat have yielded conflicting findings about whether mito-chondrial diversity correlates with population size (Bazin et al2006 Nabholz et al 2008 Nabholz et al 2009 Piganeau andEyre-Walker 2009) The observed correlation also suggeststhat mitochondria and nuclear spontaneous mutation ratesmay be correlated Direct estimation of mutation rates inmetazoans C elegans D melanogaster S cerevisiae and Hsapiens obtained via mutation accumulation experimentssuggest a moderate correlation (slopefrac14 005 R2frac14 048) be-tween mitochondrial and nuclear base substitution rates (us-ing data from Lynch et al 2008) Discordance between thenuclear and mitochondrial diversity for instance inT

able

2N

ucl

eoti

de

Div

ersi

ty(p

ersi

te)

of

Diff

eren

tSi

tes

acro

ssth

eM

ito

cho

nd

rial

Gen

om

es

Spec

ies

4-fo

ld(p

m)[

SE]

2-fo

ld[S

E]0-

fold

[SE]

tRN

A[S

E]rR

NA

[SE]

inte

rgen

ic[S

E]4-

fold

nu

clea

r(p n

)[SE

]p m

pn

Pt

etra

ure

lia0

0091

[00

010]

001

07[0

000

9]0

0019

[00

002]

000

00[0

000

0]0

0006

[00

002]

000

41[0

000

3]0

0058

[22

6

10

5]

157

Ps

exa

ure

lia0

1692

[00

033]

014

27[0

002

4]0

0097

[00

0043

]0

0000

[00

000]

000

59[0

000

7]0

0417

[00

027]

002

945

[47

3

10

5]

575

Pc

au

da

tum

012

72[0

003

7]0

0942

[00

019]

001

79[0

000

6]0

0455

[00

023]

001

36[0

001

2]0

0333

[00

027]

012

67[1

08

10

4]

100

Pc

au

da

tum

(su

bse

t)0

0929

[00

039]

008

59[0

002

1]0

0149

[00

006]

003

75[0

002

3]0

0093

[00

011]

002

92[0

002

8]0

0658

[96

3

10

5]

141

Pm

ult

imic

ron

ucl

eatu

m0

1727

[00

047]

008

31[0

001

9]0

0227

[00

008]

002

63[0

001

1]0

0269

[00

016]

003

29[0

002

2]0

1599

[13

4

10

4]

108

Pm

ult

imic

ron

ucl

eatu

m(s

ub

set)

007

30[0

003

4]0

0383

[00

015]

001

21[0

000

6]0

0126

[00

009]

001

35[0

001

2]0

0312

[00

022]

007

42[9

56

10

5]

098

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1209Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 17: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

P sexaurelia can be attributed to various factors such aspositive selection (Hudson and Turelli 2003 Galtier et al2009 James et al 2016) demographic events (Pool andNielsen 2007) or simply differences in nuclear andor mito-chondrial mutation rates (Lynch 2007) Future studies areneeded to dissect the factors responsible for the remarkablydifferent ratio of mitochondrial to nuclear diversity in Psexaurelia

ConclusionsThis is the first study where multiple individuals of severalspecies of free-living unicellular eukaryotes have been se-quenced and used to reveal genome-wide patterns of varia-tion This study thus provides an invaluable resource forfuture research in Paramecium a major model organism incell biology Sequencing of multiple individuals from the samelake or pond is needed to further resolve the global versuslocal population structure of these cosmopolitan unicellularspecies Overall our results suggest that most of the inter-genic regions across the highly compact genomes ofParamecium species are strongly constrained and probablyharbor functionally important regulatory modules whichmight extend into coding regions We show that patternsof nucleotide diversity across the genome are driven primarilyby a combination of purifying selection and mutation ratevariation In future such analyses will benefit greatly fromincorporating variation in recombination rates across the ge-nomic landscape We also find that duplicates created by theWGD appear to be evolving under similar constraintsSynonymous to non-synonymous diversity ratios need tobe supplemented with divergence levels to distinguish be-tween the various types of selective forces acting on the du-plicates Thus sequencing of much closer outgroup specieswould allow us to further understand the role of selection inthe evolution of different genomic elements including geneduplicates that resulted from ancient WGDs

Methods

Sampling DNA Extraction and SequencingWe examined a total of 56 Paramecium strains originally iso-lated from environmental sources including 11 isolates ofP biaurelia 13 isolates of P sexaurelia 12 isolates of P tetraur-elia 10 isolates of P caudatum und 10 isolates of P multi-micronucleatum A total of 14 strains from P biaurelia andP sexaurelia were kindly provided by E Przybos whereas theremaining strains were cultured in the Paramecium collec-tions of the Institute of Hydrobiology TU Dresden Germanyor the Department of Biology Indiana UniversityBloomington USA Prior to our analyses P tetraurelia strainMoz 13CXI as well as P sexaurelia strains Moz 13BIII and Indo17I were kindly identified by E Przybos by performing inter-and intrastrain crosses (Przybos 1975) Paramecia were main-tained in 025 Cerophyl medium (modified according toSonneborn 1970 Krenek et al 2011) inoculated withEnterobacter aerogenes Since P aurelia species can undergoautogamy (self-fertilization) induced by starvation and allstrains have been cultured under lab conditions for several

years all P aurelia strains in this study are expected to bemore homozygous than wild isolates

Prior to DNA extraction all cultures were initiated from asingle cell ensuring clonality A modified sucrose density sep-aration (Aury et al 2006) was performed to extract primarilymacronuclei (MACs) and to reduce the amount of non-MACDNA potentially comprising subsequent analyses BrieflyParamecium cells were starved for at least two days andfiltrated through several layers of sterile cheesecloth prior tocell harvesting via gentle centrifugation Pelleted cells weregently lysed in 025M sucrose medium (pH 68) containing10mM Tris-HCl 8mM MgCl2 3mM CaCl2 05ndash1 Nonidet-P40 and 01ndash02 sodium deoxycholate Macronuclei weresubsequently isolated away from cellular debris micronucleiand mitochondria by centrifugation at 100rcf in 025M su-crose wash buffer (see above wo Nonidet-P40 andNa-deoxycholate) Genomic DNA extraction from MAC iso-lations was performed by a modified CTAB procedure(Gustincich et al 1991)

All genomes were sequenced using the Illumina HiSeq2000and paired-end reads with insert sizes between 200 and400 bp and mean length of 151 bp were obtained The readsmapped with coverage ranging between 40 and 80 (detailsgiven in supplementary table S1 Supplementary Materialonline)

Paramecium Reference GenomesComplete genomes of P tetraurelia P biaurelia P sexaureliaand P caudatum were downloaded from ParameciumDB(Arnaiz et al 2007 Arnaiz and Sperling 2011) along withthe annotations Ortho-paralog relationships as well as func-tional annotations were retrieved from the supplementaryinformation provided in McGrath et al (2014a 2014b)RNAseq data for all four species was downloaded and pro-cessed as in Gout and Lynch (2015)

The genome of P multimicronucleatum was assembledusing Celera (Miller et al 2008) and annotated usingEugene (Foissac et al 2008) as previously described(McGrath et al 2014a 2014b) A detailed description of thesequencing assembly and annotation of P multimicronuclea-tum can be found in Marinov et al (in preparation) OMA(Orthology MAtrix) (Altenhoff et al 2015) was used to findorthologs between P caudatum and P multimicronucleatum11002 genes of P caudatum were found to have 1-1 orthologsin P multimicronucleatum and only 1-1 orthologs were usedfor the current analysis

Mapping of Reads and Detection of SNPs and IndelsSequenced paired-end reads were mapped to the referencegenomes for each species using BWA (bwa mem) (Li andDurbin 2010) with default parameters and the mpileup func-tion in Samtools (Li et al 2009) was used to generate mpileupfiles for each individual Two distinct pipelines were employedto identify SNPs to ensure detection accuracy Bcftools-Vcftools (Li 2011) as well as the maximum likelihoodGenotype Frequency Estimator (referred to as GFE) (Lynch2009 Maruki and Lynch 2015) Both pipelines producedhighly overlapping sets of SNPs (supplementary table S2

Johri et al doi101093molbevmsx074 MBE

1210Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 18: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Supplementary Material online) For both pipelines sites wereonly considered for SNP detection and further analysis if themapping quality was above 30 base quality was above 20per-base alignment quality was above 15 and the sum of thedepth of coverage for all individuals was about 5 times thenumber of individuals and less than twice the average pop-ulation coverage SNPs were called by bcftools if the SNPquality score was above 30 and by GFE if the significancescoregt591 (Pgt 005) The exact parameters of theBcftools-Vcftools pipeline to call SNPs and indels are givenbelow

samtools mpileup -q 30 -Q 20 -t DP -uvf reference_genome-fasta sortedbamgt speciesvcf

bcftools call -f GQ -o species_snpsvcf -O v -V indels -cv speciesvcf

bcftools call -f GQ -o species_indelsvcf -O v -V snps -cv speciesvcf

filtering snp and indel files for depth and biasesperl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_snpsvcfgt species_snpsfiltered1vcf

perl vcfutilspl varFilter -d 50 -D 800 -1 -2 -3 -4 species_indelsvcfgt species_indelsfiltered1vcf

vcftools ndashvcf species_snpsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_snpsfiltered2

vcftools ndashvcf species1_indelsfiltered1vcf ndashminQ 20 ndashminGQ30 ndashminDP 4 ndashrecode ndashrecode-INFO-all ndashout species_indelsfiltered2

filter out the recoded sites that are not polymorphicanymore that is AF1frac14 1bcftools view -q 0001 species_snpsfiltered2recodevcf -ospecies_snpsfinalvcf

bcftools view -q 0001 species_indelsfiltered2recodevcf -ospecies_indelsfinalvcf

Mitochondrial SNPs were called using the Bcftools-Vcftools pipeline with the same parameters as noted above

Simulation of SNP Detection with Varying Levels ofDivergence from the ReferenceWe simulated reads using wgsim (httpsgithubcomlh3wgsim) for a diploid genome with varying levels of divergencefrom the reference genome of P caudatum and P multimi-cronucleatum using average values of read length total num-ber of reads and insert sizes

For the P caudatum reference genomewgsim -d 287 -s 113 -N 10201087 -1 151 -2 151 -r 010 -R

0013 caudatum_assemblyfasta caud_010_1fastq caud_010_2fastqgtamp caud_010info

For the P multimicronucleatum reference genomewgsim -d 261 -s 102 -N 12154044 -1 151 -2 151 -r 010 -R

0011 multimicronucleatum_assemblyfasta multi_010_1fastqmulti_010_2fastqgtamp multi_010info

Reads were simulated for rfrac14 00 005 010 015 020 025and 030 which represents the fraction of the genome withSNPs The SNP detection pipeline described above was usedto call SNPs which were compared to the simulated set of

SNPs (supplementary table S4 Supplementary Materialonline)

Building PhylogeniesA consensus genome (bcftools view species_filteredrecodevcfjvcfutilspl vcf2fqgt speciescnsfq) was obtained for each indi-vidual using bcftools that represented heterozygous sites us-ing DNA ambiguity codes Gene nucleotide sequences thatare present in single-copy in the P aurelia species and are alsopresent in both the outgroup species (P caudatum and Pmultimicronucleatum) were extracted and multiple sequencealignments were generated using MUSCLE (Edgar 2004a2004b) This amounted to about 55 sequences (includingthe reference genomes) each in 883 sets of orthologous genesRAxML8026 (Stamatakis 2014) was used to build a treeusing the concatenation of all 883 aligned gene sequences(raxml -s concatenated_sequencesfasta -n gene -mGTRGAMMA -T 4 -p 7) In addition individual gene trees(raxml -s genefasta -n gene -m GTRGAMMA -T 4 -p 7) werebuilt which were then used to create a consensus tree (raxml-J MR -z treesnewick -T 4 -m GTRGAMMA) (Aberer et al2010)

Mitochondrial phylogeny was built using the same param-eters as above by concatenating eight sets of orthologousgenes (NADH_dehydrogenase_subunit_1_a NADH_dehy-drogenase_subunit_5 Ymf56orf78 Ymf67orf265_1Ymf68orf393 apocytochrome_b cytochrome_c_oxidase_-subunit_1 cytochrome_c_oxidase_subunit_2 ribosomal_protein_S14) that were ubiquitously present across all sam-pled species

Principal Component AnalysisGenotypes for all individuals were obtained from the vcf filesgenerated by vcftools for all scaffolds Only the sites that had agenotype available for all individuals were used in the analysisie if even a single individual had missing data the site wasexcluded from analysis A total of10000 markers were usedfor each species The first marker of each scaffold was pickedrandomly within the first 50 SNP positions and the rest wereselected so that all markers were uniformly distributed acrossall scaffolds PCA was carried out on these markers by usingthe R (R-Core-Team 2014) command prcomp (httpsstatethzchR-manualR-devellibrarystatshtmlprcomphtml)and visualized using the first two components with the Rgraphing package ggplot2 (Wickham 2010)

Calculation of Population Genetic StatisticsNucleotide Diversity and Tajimarsquos DNucleotide diversity was calculated as 2pqn(n-1) wherep and q are the major and minor allele frequency and n isthe number of chromosomes The major and minor allelefrequency was given for each site by GFE Diversity at codingsites was partitioned into synonymous and non-synonymouscomponents (eg 2-fold degenerate sites are one-thirdd

synonymous and two-third non-synonymous) Major allelefrequencies were also extracted from vcftools and returnedvery similar results as those from GFE (supplementary tableS3 Supplementary Material online)

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1211Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 19: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

A sliding window approach was used to calculate TajimarsquosD across the genome Vcftools was used to obtain thesevalues for non-overlapping windows of 1 kb in length Theprogram ms (Hudson 2002) was used to perform 100000simulations of 10 individuals with 1 kb locus length andniceStats (httpsgithubcomkern-labshICblobmasterniceStatsc) was used to calculate Tajimarsquos D from the msoutput file To obtain Tajimarsquos D for different categories ofsites SNPs belonging to particular category of sites (specifiedin Results) were selected and vcftools was used to calculateTajimarsquos D for non-overlapping windows of 10 kb Windowswith 2 or more SNPs were used to calculate the mean andonly alternate windows were used to compare distributionsof Tajimarsquos D across different categories by Studentrsquos t-test tomaximise independence of sampling

Plotting Average Expected Heterozygosity acrossIntrons and Intergenic RegionsAll introns (lt 500 bp in length) were split midway and alignedaccording to the 50 and 30 splice sites All intergenic regions(lt 500 bp in the outgroup species andlt 1000 bp in the Paurelia species) were split midway and aligned according tothe start and stop codons Note that these were not sequencealignments At each site the mean expected heterozygositywas calculated Sites with insufficient number of datapointslt 200 data points per site for introns while for inter-geniclt 100 data points per site for outgroups andlt 1000data points per site in the P aurelia species were discarded

Calculating and Plotting Divergence across Intronsand Intergenic RegionsOrthologous coding sequences along with introns werealigned using MUSCLE (Edgar 2004a) Intergenic regionsflanked by orthologous genes on both sides were also alignedalong with the flanking coding exon using MUSCLE For eachspecies separately divergence for different positions (wrt thestart and stop sites) was averaged across all introns and inter-genic regions The divergence values were corrected for mul-tiple substitutions by the Jukes-Cantor sequence substitutionmodel and plotted in the same manner as described above

GO AnalysisAll gene ontology enrichment analyses were performed usingBiNGO plugin for Cytoscape 321 (Shannon et al 2003 Maereet al 2005) using the hypergeometric test to assess signifi-cance with Benjamini and Hochberg FDR correction for mul-tiple testing (Benjamini and Hochberg 1995) Latest ontologyfiles (obo) were downloaded from wwwgeneontologyorgand custom annotation files were prepared as required usingpython scripts

Identification of heterozygous regions in theP Aurelia SpeciesFraction of heterozygous sites was calculated for 10 bp non-overlapping windows across the genome for all three P aur-elia species and the top 1 such regions that have the highest

fraction of heterozygous sites were identified as heterozygousregions

Mitochondrial Genome Assembly and AnnotationMitochondrial genomes were assembled as follows Sequencereads were assembled using SPAdes (version 350) (Bankevichet al 2012) after removing potential adapter sequence withTrimmomatic (version 033) (Bolger et al 2014)Mitochondrial contigs were identified from the resulting as-semblies by BLAST searches against the published P cauda-tum and P tetraurelia mitochondrial genomes (Pritchard et al1990 Barth and Berendonk 2011)

Protein coding genes were identified by generating allORFs longer than 60aa in all six frames using the MoldProtozoan and Coelenterate Mitochondrial Code (ie UGAcodes for W instead of being a stop codon) and all alternativestart codons specific to Paramecium (AUU AUA AUG AUCGUG and GUA) and retaining the longest ORFs associatedwith each stop codon BLASTP (Altschul et al 1997) was thenused to identify homologs of annotated mitochondrial pro-teins in P tetraurelia and P caudatum Additional ORFs wereidentified by imposing the requirement that their length ex-ceeds 100aa and subsequently annotated using BLASTPagainst the Non-redundant protein sequences (nr) databaseand HMMER30 (Eddy 2011) scans against PFAM 270 data-base (Finn et al 2014)

tRNA genes were annotated with tRNAscan-SE 121(Schattner et al 2005) using the ldquoMitoChloroplastrsquorsquo source

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe thank Lydia Bright and Jean-Francois Gout for helpfulcomments on the manuscript Matthew S Ackerman for in-sightful discussions related to the project and the anonymousreviewers and editor for their valuable suggestions We wouldalso like to thank Kelley Thomas for providing help with thesequencing and E Przybos for kindly providing severalParamecium strains and the species identification of someso far unspecified strains This work was financially supportedby the National Science Foundation (MCB-1050161 and DEB-1257806) the European Commission FP7-PEOPLE-2009-IRSESproject CINAR PATHOBACTER (project number 247658)and the CNRS research network GDRE ldquoParameciumGenome Dynamics and Evolutionrdquo TUB also acknowledgesfunding of the ldquosupport the bestrdquo of the ZUK TU Dresdenand the Deutsche Forschungsgemeinschaft (DFG)

ReferencesAbbott RJ Gomes MF 1989 Population genetic-structure and outcross-

ing rate of Arabidopsis-Thaliana (L) Heynh Heredity 62411ndash418Abecasis GR Altshuler D Auton A Brooks LD Durbin RM Gibbs RA

Hurles ME McVean GA 2010 A map of human genome variationfrom population-scale sequencing Nature 4671061ndash1073

Johri et al doi101093molbevmsx074 MBE

1212Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 20: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Abecasis GR Auton A Brooks LD DePristo MA Durbin RM HandsakerRE Kang HM Marth GT McVean GA 2012 An integrated map ofgenetic variation from 1092 human genomes Nature 49156ndash65

Aberer AJ Pattengale ND Stamatakis A 2010 Parallelized phylogeneticpost-analysis on multi-core architectures J Comput Sci-Neth1107ndash114

Ackermann AA Panunzi LG Cosentino RO Sanchez DO Aguero F2012 A genomic scale map of genetic diversity in Trypanosomacruzi BMC Genomics 13736

Altenhoff AM Skunca N Glover N Train CM Sueki A Pilizota I Gori KTomiczek B Muller S Redestig H et al 2015 The OMA orthologydatabase in 2015 function predictions better plant support syntenyview and other improvements Nucleic Acids Res 43D240ndashD249

Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller WLipman DJ 1997 Gapped BLAST and PSI-BLAST a new generationof protein database search programs Nucleic Acids Res253389ndash3402

Arnaiz O Cain S Cohen J Sperling L 2007 ParameciumDB a communityresource that integrates the Paramecium tetraurelia genome se-quence with genetic data Nucleic Acids Res 35D439ndashD444 4

Arnaiz O Mathy N Baudry C Malinsky S Aury JM Wilkes CD Garnier OLabadie K Lauderdale BE Le Mouel A et al 2012 The Parameciumgermline genome provides a niche for intragenic parasitic DNAevolutionary dynamics of internal eliminated sequences PLoSGenet 8e1002984

Arnaiz O Sperling L 2011 ParameciumDB in 2011 new tools and newdata for functional and comparative genomics of the model ciliateParamecium tetraurelia Nucleic Acids Res 39D632ndashD636

Assefa S Lim C Preston MD Duffy CW Nair MB Adroub SA Kadir KAGoldberg JM Neafsey DE Divis P et al 2015 Population genomicstructure and adaptation in the zoonotic malaria parasitePlasmodium knowlesi Proc Natl Acad Sci U S A 11213027ndash13032

Aury JM Jaillon O Duret L Noel B Jubin C Porcel BM Segurens BDaubin V Anthouard V Aiach N et al 2006 Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraureliaNature 444171ndash178

Auton A Brooks LD Durbin RM Garrison EP Kang HM Korbel JOMarchini JL McCarthy S McVean GA Abecasis GR 2015 A globalreference for human genetic variation Nature 52668ndash74

Bankevich A Nurk S Antipov D Gurevich AA Dvorkin M Kulikov ASLesin VM Nikolenko SI Pham S Prjibelski AD et al 2012 SPAdes anew genome assembly algorithm and its applications to single-cellsequencing J Comput Biol 19455ndash477

Barth D Berendonk TU 2011 The mitochondrial genome sequence ofthe ciliate Paramecium caudatum reveals a shift in nucleotide com-position and codon usage within the genus Paramecium BMCGenomics 12272

Barth D Krenek S Fokin SI Berendonk TU 2006 Intraspecific geneticvariation in Paramecium revealed by mitochondrial cytochrome Coxidase I sequences J Eukaryot Microbiol 5320ndash25

Bazin E Glemin S Galtier N 2006 Population size does not influencemitochondrial genetic diversity in animals Science 312570ndash572

Beale G Preer JR 2008 Paramecium genetics and epigenetics BocaRaton (FL) CRC Press

Begun DJ Aquadro CF 1992 Levels of naturally-occurring DNA poly-morphism correlate with recombination rates in Drosophila mela-nogaster Nature 356519ndash520

Begun DJ Holloway AK Stevens K Hillier LW Poh YP Hahn MW NistaPM Jones CD Kern AD Dewey CN et al 2007 Population genomicswhole-genome analysis of polymorphism and divergence inDrosophila simulans PLoS Biol 5e310

Benjamini Y Hochberg Y 1995 Controlling the false discovery ratemdashapractical and powerful approach to multiple testing J Roy Stat Soc BMet 57289ndash300

Betermier M 2004 Large-scale genome remodelling by the developmen-tally programmed elimination of germ line sequences in the ciliateParamecium Res Microbiol 155399ndash408

Birky CW Walsh JB 1988 Effects of linkage on rates of molecular evo-lution Proc Natl Acad Sci U S A 856414ndash6418

Birnbaum RY Clowney EJ Agamy O Kim MJ Zhao J Yamanaka TPappalardo Z Clarke SL Wenger AM Nguyen L et al 2012Coding exons function as tissue-specific enhancers of nearby genesGenome Res 221059ndash1068

Boel G Letso R Neely H Price WN Wong KH Su M Luff JD Valecha MEverett JK Acton TB et al 2016 Codon influence on protein ex-pression in E coli correlates with mRNA levels Nature 529358ndash363

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 302114ndash2120

Brandvain Y Wright SI 2016 The limits of natural selection in a non-equilibrium world Trends Genet 32201ndash210

Brunk CF Lee LC Tran AB Li J 2003 Complete sequence of the mito-chondrial genome of Tetrahymena thermophila and comparativemethods for identifying highly divergent genes Nucleic Acids Res311673ndash1682

Burger G Zhu Y Littlejohn TG Greenwood SJ Schnare MN Lang BFGray MW 2000 Complete sequence of the mitochondrial genomeof Tetrahymena pyriformis and comparison with Paramecium aur-elia mitochondrial DNA J Mol Biol 297365ndash380

Burki F 2014 The eukaryotic tree of life from a global phylogenomicperspective Cold Spring Harb Perspect Biol 6a016147

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz JKoenig D Lanz C Stegle O Lippert C et al 2011 Whole-ge-nome sequencing of multiple Arabidopsis thaliana populationsNat Genet 43956ndash963

Catania F Wurmser F Potekhin AA Przybos E Lynch M 2009 Geneticdiversity in the Paramecium aurelia species complex Mol Biol Evol26421ndash431

Chalker DL Yao MC 2011 DNA elimination in ciliates transposon do-mestication and genome surveillance Annu Rev Genet 45227ndash246

Charlesworth D 2003 Effects of inbreeding on the genetic diversity ofpopulations Philos T Roy Soc B 3581051ndash1070

Charlesworth B Morgan MT Charlesworth D 1993 The effect of dele-terious mutations on neutral molecular variation Genetics1341289ndash1303

Chia JM Song C Bradbury PJ Costich D de Leon N Doebley J Elshire RJGaut B Geller L Glaubitz JC et al 2012 Maize HapMap2 identifiesextant variation from a genome in flux Nat Genet 44803ndash807

Comeron JM Williford A Kliman RM 2008 The Hill-Robertson effectevolutionary consequences of weak selection and linkage in finitepopulations Heredity 10019ndash31

Dey A Chan CK Thomas CG Cutter AD 2013 Molecular hyperdiversitydefines populations of the nematode Caenorhabditis brenneri ProcNatl Acad Sci U S A 11011056ndash11060

Dong X Navratilova P Fredman D Drivenes O Becker TS Lenhard B2010 Exonic remnants of whole-genome duplication reveal cis-regulatory function of coding exons Nucleic Acids Res381071ndash1085

Downing T Imamura H Decuypere S Clark TG Coombs GH Cotton JAHilley JD de Doncker S Maes I Mottram JC et al 2011 Wholegenome sequencing of multiple Leishmania donovani clinical isolatesprovides insights into population structure and mechanisms of drugresistance Genome Res 212143ndash2156

Drummond DA Bloom JD Adami C Wilke CO Arnold FH 2005 Whyhighly expressed proteins evolve slowly Proc Natl Acad Sci U S A10214338ndash14343

Eddy SR 2011 Accelerated profile HMM searches PLoS Comput Biol7e1002195

Edgar RC 2004a MUSCLE a multiple sequence alignment method withreduced time and space complexity BMC Bioinformatics 5113

Edgar RC 2004b MUSCLE multiple sequence alignment with high ac-curacy and high throughput Nucleic Acids Res 321792ndash1797

Eichenlaub MP Ettwiller L 2011 De novo genesis of enhancers in ver-tebrates PLoS Biol 9e1001188

Ellison CE Hall C Kowbel D Welch J Brem RB Glass NL Taylor JW 2011Population genomics and local adaptation in wild isolates of a modelmicrobial eukaryote Proc Natl Acad Sci U S A 1082831ndash2836

Elyashiv E Bullaughey K Sattath S Rinott Y Przeworski M Sella G 2010Shifts in the intensity of purifying selection an analysis of genome-

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1213Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 21: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

wide polymorphism data from two closely related yeast speciesGenome Res 201558ndash1573

Engelthaler DM Hicks ND Gillece JD Roe CC Schupp JM Driebe EMGilgado F Carriconde F Trilles L Firacative C et al 2014Cryptococcus gattii in North American Pacific Northwest whole-population genome analysis provides insights into species evolutionand dispersal mBio 5e01464ndashe01414

Fawcett JA Iida T Takuno S Sugino RP Kado T Kugou K Mura SKobayashi T Ohta K Nakayama J et al 2014 Population genomicsof the fission yeast Schizosaccharomyces pombe PLoS ONE9e104241

Fenchel T Finlay BJ 2004 The ubiquity of small species patterns of localand global diversity Bioscience 54777ndash784

Finlay BJ 2002 Global dispersal of free-living microbial eukaryote speciesScience 2961061ndash1063

Finlay BJ Esteban GF Fenchel T 2004 Protist diversity is different Protist15515ndash22

Finn RD Bateman A Clements J Coggill P Eberhardt RY Eddy SR HegerA Hetherington K Holm L Mistry J et al 2014 Pfam the proteinfamilies database Nucleic Acids Res 42D222ndashD230

Foissac S Gouzy J Rombauts S Mathe C Amselem J Sterck L Van dePeer Y Rouze P Schiex T 2008 Genome annotation in plants andfungi EuGene as a model platform Curr Bioinform 387ndash97

Foissner W 1999 Protist diversity estimates of the near-imponderableProtist 150363ndash368

Foissner W 2006 Biogeography and dispersal of micro-organisms areview emphasizing protists Acta Protozool 45111ndash136

Foissner W 2008 Protist diversity and distribution some basic consid-erations Biodivers Conserv 17235ndash242

Foissner W 2011 Dispersal of protists the role of cysts and humanintroductions In Fontaneto D editor Biogeography of microscopicorganisms is everything small everywhere Cambridge UniversityPress p 61ndash87

Foissner W Chao A Katz LA 2008 Diversity and geographic distributionof ciliates (Protista Ciliophora) Biodivers Conserv 17345ndash363

Fournier-Level A Korte A Cooper MD Nordborg M Schmitt J WilczekAM 2011 A map of local adaptation in Arabidopsis thaliana Science33486ndash89

Galtier N Nabholz B Glemin S Hurst GDD 2009 Mitochondrial DNA asa marker of molecular diversity a reappraisal Molr Ecol184541ndash4550

Gout JF Kahn D Duret L Paramecium Post-Genomics C 2010 Therelationship among gene expression the evolution of genedosage and the rate of protein evolution PLoS Genet6e1000944

Gout JF Lynch M 2015 Maintenance and loss of duplicated genes bydosage subfunctionalization Mol Biol Evol 322141ndash2148

Gruchy DF 1955 The breeding system and distribution of Tetrahymenapyriformis J Protozool 2178ndash185

Gustincich S Manfioletti G Del Sal G Schneider C Carninci P 1991 Afast method for high-quality genomic DNA extraction from wholehuman blood BioTechniques 11298ndash300 302

Hancock AM Witonsky DB Alkorta-Aranburu G Beall CMGebremedhin A Sukernik R Utermann G Pritchard JK Coop GDi Rienzo A 2011 Adaptations to climate-mediated selective pres-sures in humans PLoS Genet 7e1001375

Hiwatashi K 1968 Determination and inheritance of mating type inParamecium caudatum Genetics 58373ndash386

Hori M Tomikawa I Przybos E Fujishima M 2006 Comparison of theevolutionary distances among syngens and sibling species ofParamecium Mol Phylogenet Evol 38697ndash704

Huang X Kurata N Wei X Wang ZX Wang A Zhao Q Zhao Y Liu K LuH Li W et al 2012 A map of rice genome variation reveals the originof cultivated rice Nature 490497ndash501

Huang X Zhao Q Han B 2015 Comparative population genomics re-veals strong divergence and infrequent introgression between Asianand African rice Mol Plant 8958ndash960

Hudson RR 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18337ndash338

Hudson RR Turelli M 2003 Stochasticity overrules the ldquothree-timesrulerdquo genetic drift genetic draft and coalescence times for nuclearloci versus mitochondrial DNA Evolution 57182ndash190

Jaillon O Bouhouche K Gout JF Aury JM Noel B Saudemont BNowacki M Serrano V Porcel BM Segurens B et al 2008Translational control of intron splicing in eukaryotes Nature451359ndash362

James JE Piganeau G Eyre-Walker A 2016 The rate of adaptive evolutionin animal mitochondria Mol Ecol 2567ndash78

Joy DA Feng X Mu J Furuya T Chotivanich K Krettli AU Ho M WangA White NJ Suh E et al 2003 Early origin and recent expansion ofPlasmodium falciparum Science 300318ndash321

Kaplan NL Hudson RR Langley CH 1989 The hitchhiking effect re-visited Genetics 123887ndash899

Katz LA McManus GB Snoeyenbos-West OLO Griffin A Pirog K CostasB Foissner W 2005 Reframing the 0Everything is everywhere0 debateevidence for high gene flow and diversity in ciliate morphospeciesAquat Microb Ecol 4155ndash65

Katz LA Snoeyenbos-West O Doerder FP 2006 Patterns of proteinevolution in Tetrahymena thermophila implications for estimatesof effective population size Mol Biol Evol 23608ndash614

Krenek S Berendonk TU Fokin SI 2015 New Paramecium (CiliophoraOligohymenophorea) congeners shape our view on its biodiversityOrg Divers Evol 15215ndash233

Krenek S Berendonk TU Petzoldt T 2011 Thermal performance curvesof Paramecium caudatum a model selection approach Eur JProtistol 47124ndash137

Krenek S Petzoldt T Berendonk TU 2012 Coping with temperature atthe warm edgendashpatterns of thermal adaptation in the microbialeukaryote Paramecium caudatum PLoS One 7e30598

Kudla G Murray AW Tollervey D Plotkin JB 2009 Coding-sequencedeterminants of gene expression in Escherichia coli Science324255ndash258

Kvitek DJ Will JL Gasch AP 2008 Variations in stress sensitivity andgenomic expression in diverse S cerevisiae isolates PLoS Genet4e1000223

Lahr DJ Laughinghouse HDt Oliverio AM Gao F Katz LA 2014 Howdiscordant morphological and molecular evolution among micro-organisms can revise our notions of biodiversity on Earth Bioessays36950ndash959

Langley CH Stevens K Cardeno C Lee YC Schrider DR Pool JE LangleySA Suarez C Corbett-Detig RB Kolaczkowski B et al 2012 Genomicvariation in natural populations of Drosophila melanogaster Genetics192533ndash598

Leffler EM Bullaughey K Matute DR Meyer WK Segurel L Venkat AAndolfatto P Przeworski M 2012 Revisiting an old riddle whatdetermines genetic diversity levels within species PLoS Biol10e1001388

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimationfrom sequencing data Bioinformatics 272987ndash2993

Li H Durbin R 2010 Fast and accurate long-read alignment withBurrows-Wheeler transform Bioinformatics 26589ndash595

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R Genome Project Data Processing S 2009 Thesequence alignmentMap format and SAMtools Bioinformatics252078ndash2079

Lim LP Burge CB 2001 A computational analysis of sequence featuresinvolved in recognition of short introns Proc Natl Acad Sci U S A9811193ndash11198

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Loveless MD Hamrick JL 1984 Ecological determinants of genetic-structure in plant-populations Annu Rev Ecol Syst 1565ndash95

Lynch M 2007 The origins of genome architecture Sunderland (MA)Sinauer Associates

Lynch M 2009 Estimation of allele frequencies from high-coverage ge-nome-sequencing projects Genetics 182295ndash301

Johri et al doi101093molbevmsx074 MBE

1214Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 22: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Lynch M Field MC Goodson HV Malik HS Pereira-Leal JB Roos DSTurkewitz AP Sazer S 2014 Evolutionary cell biology two originsone objective Proc Natl Acad Sci U S A 11116990ndash16994

Lynch M Sung W Morris K Coffey N Landry CR Dopman EB DickinsonWJ Okamoto K Kulkarni S Hartl DL et al 2008 A genome-wideview of the spectrum of spontaneous mutations in yeast Proc NatlAcad Sci U S A 1059272ndash9277

Maere S Heymans K Kuiper M 2005 BiNGO a Cytoscape plugin toassess overrepresentation of gene ontology categories in biologicalnetworks Bioinformatics 213448ndash3449

Maguire B Belk D 1967 Paramecium transport by land snails J Protozool14445ndash447

Maruki T Lynch M 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201473ndash486

McGrath CL Gout JF Doak TG Yanagi A Lynch M 2014a Insights intothree whole-genome duplications gleaned from the Parameciumcaudatum genome sequence Genetics 1971417ndash1428

McGrath CL Gout JF Johri P Doak TG Lynch M 2014b Differentialretention and divergent resolution of duplicate genes followingwhole-genome duplication Genome Res 241665ndash1675

Miller JR Delcher AL Koren S Venter E Walenz BP Brownley A JohnsonJ Li K Mobarry C Sutton G 2008 Aggressive assembly of pyrose-quencing reads with mates Bioinformatics 242818ndash2824

Moradian MM Beglaryan D Skozylas JM Kerikorian V 2007 Completemitochondrial genome sequence of three Tetrahymena species re-veals mutation hot spots and accelerated nonsynonymous substi-tutions in Ymf genes PLoS ONE 2e650

Nabholz B Glemin S Galtier N 2009 The erratic mitochondrial clockvariations of mutation rate not population size affect mtDNA di-versity across birds and mammals BMC Evol Biol 954

Nabholz B Mauffrey JF Bazin E Galtier N Glemin S 2008 Determinationof mitochondrial genetic diversity in mammals Genetics178351ndash361

Neafsey DE Galinsky K Jiang RH Young L Sykes SM Saif S Gujja SGoldberg JM Young S Zeng Q et al 2012 The malaria parasitePlasmodium vivax exhibits greater genetic diversity thanPlasmodium falciparum Nat Genet 441046ndash1050

Novembre J Johnson T Bryc K Kutalik Z Boyko AR Auton A Indap AKing KS Bergmann S Nelson MR et al 2008 Genes mirror geogra-phy within Europe Nature 45698ndash101

Ohta T 1974 Mutational pressure as main cause of molecular evolutionand polymorphism Nature 252351ndash354

Ohta T 1992 The nearly neutral theory of molecular evolution AnnuRev Ecol Syst 23263ndash286

Pawlowski J Audic S Adl S Bass D Belbahri L Berney C Bowser SSCepicka I Decelle J Dunthorn M et al 2012 CBOL protist workinggroup barcoding eukaryotic richness beyond the animal plant andfungal kingdoms PLoS Biol 10e1001419

Piganeau G Eyre-Walker A 2009 Evidence for variation in the effectivepopulation size of animal mitochondrial DNA PLoS One 4e4396

Pool JE Nielsen R 2007 Population size changes reshape genomic pat-terns of diversity Evolution 613001ndash3006

Pophaly SD Tellier A 2015 Population level purifying selection and geneexpression shape subgenome evolution in maize Mol Biol Evol323226ndash3235

Prescott DM 1994 The DNA of ciliated protozoa Microbiol Rev58233ndash267

Pritchard AE Seilhamer JJ Mahalingam R Sable CL Venuti SECummings DJ 1990 Nucleotide sequence of the mitochondrial ge-nome of Paramecium Nucleic Acids Res 18173ndash180

Przybos E 1975 Genetic studies of Paramecium jenningsi strains (DillerEarl 1958) Folia Biol (Krakow) 23425ndash471

Przybos E 2005 Recent data on the occurrence of species of theParamecium aurelia complex in Europe Folia Biol-Krakow 5361ndash63

Przybos E Barth D Berendonk TU 2008 The Paramecium aurelia speciescomplex frequency and co-occurrence across Europe Folia Biol(Krakow) 5677ndash81

Przybos E Rautian M Greczek-Stachura M Potekhin A 2007Polymorphism within Paramecium sexaurelia (Ciliophora

Oligohymenophorea) and description of a new stand of the speciesin China Folia Biol (Krakow) 55121ndash125

Przybos E Surmacz M 2010 New world-wide data on the distribution ofspecies of the Paramecium aurelia complex (Ciliophora Protozoa)Folia Biol-Krakow 58185ndash188

R-Core-Team 2014 R A language and environment for statisticalcomputing R Foundation for Statistical Computing ViennaAustria

Redden H Alper HS 2015 The development and characterization ofsynthetic minimal yeast promoters Nat Commun 67810

Ritter DI Dong Z Guo S Chuang JH 2012 Transcriptional enhancers inprotein-coding exons of vertebrate developmental genes PLoS One7e35202

Russell CB Fraga D Hinrichsen RD 1994 Extremely short 20-33 nucle-otide introns are the standard length in Paramecium tetraureliaNucleic Acids Res 221221ndash1225

Salim HMW Ring KL Cavalcanti ARO 2008 Patterns of codon usage intwo ciliates that reassign the genetic code Tetrahymena thermo-phila and Paramecium tetraurelia Protist 159283ndash298

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Schattner P Brooks AN Lowe TM 2005 The tRNAscan-SE snoscan andsnoGPS web servers for the detection of tRNAs and snoRNAsNucleic Acids Res 33W686ndashW689

Schrider DR Kern AD 2015 Inferring selective constraint from popula-tion genomic data suggests recent regulatory turnover in the humanbrain Genome Biol Evol 73511ndash3528

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T 2003 Cytoscape a software environmentfor integrated models of biomolecular interaction networks GenomeRes 132498ndash2504

Simonsen KL Churchill GA Aquadro CF 1995 Properties of statisticaltests of neutrality for DNA polymorphism data Genetics141413ndash429

Smith DR Lee RW 2008 Nucleotide diversity in the mitochondrialand nuclear compartments of Chlamydomonas reinhardtii in-vestigating the origins of genome architecture BMC Evol Biol8156

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenetics Res 89391ndash403

Snoke MS Berendonk TU Barth D Lynch M 2006 Large global effectivepopulation sizes in Paramecium Mol Biol Evol 232474ndash2479

Sonneborn TM 1970 Methods in Paramecium research In MethodsCell Biol New York and London Academic Press p 241ndash339

Sonneborn TM 1975 Paramecium aurelia complex of fourteen siblingspecies T Am Microsc Soc 94155ndash178

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic anal-ysis and post-analysis of large phylogenies Bioinformatics301312ndash1313

Stoeck T Welter H Seitz-Bender D Kusch J Schmidt HJ 2000 ARDRAand RAPD-fingerprinting reject the sibling species concept for theciliate Paramecium caudatum (Ciliophora Protoctista) Zool Scr2975ndash82

Sun M 1999 Cleistogamy in Scutellaria indica (Labiatae) effectivemating system and population genetic structure Mol Ecol81285ndash1295

Swart EC Nowacki M Shum J Stiles H Higgins BP Doak TG SchotanusK Magrini VJ Minx P Mardis ER et al 2012 The Oxytricha trifallaxmitochondrial genome Genome Biol Evol 4136ndash154

Tajima F 1989 Statistical method for testing the neutral mutation hy-pothesis by DNA polymorphism Genetics 123585ndash595

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 3931ndash40

Tsukii Y Hiwatashi K 1983 Genes controlling mating-type specificity inParamecium caudatum three loci revealed by intersyngenic crossesGenetics 10441ndash62

Population Genomics of Paramecium Species doi101093molbevmsx074 MBE

1215Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018

Page 23: Population Genomics of Paramecium Species...Population Genomics of Paramecium Species Parul Johri,*,1 Sascha Krenek,2 Georgi K. Marinov,1 Thomas G. Doak,1,3 Thomas U. Berendonk,2 and

Tsukii Y Hiwatashi K 1985 Meiotic nondisjunction and aneuploids inintersyngenic hybrids of Paramecium caudatum Genetics111779ndash794

Tuller T Zur H 2015 Multiple roles of the coding sequence 5rsquo end ingene expression regulation Nucleic Acids Res 43 13ndash28

Voight BF Kudaravalli S Wen X Pritchard JK 2006 A map of recentpositive selection in the human genome PLoS Biol 4e72

Volkman SK Sabeti PC DeCaprio D Neafsey DE Schaffner SFMilner DA Jr Daily JP Sarr O Ndiaye D Ndir O et al 2007A genome-wide map of diversity in Plasmodium falciparum NatGenet 39113ndash119

Weir W Capewell P Foth B Clucas C Pountain A Steketee P Veitch NKoffi M De Meeus T Kabore J et al 2016 Population genomicsreveals the origin and asexual evolution of human infective trypano-somes eLife 5e11473

Weisse T 2008 Distribution and diversity of aquatic protists anevolutionary and ecological perspective Biodivers Conserv17243ndash259

Wickham H 2010 ggplot2 Elegant graphics for data analysis New YorkSpringer-Verlag

Zhang JZ Yang JR 2015 Determinants of the rate of protein sequenceevolution Nat Rev Genet 16409ndash420

Johri et al doi101093molbevmsx074 MBE

1216Downloaded from httpsacademicoupcommbearticle-abstract34511942992914by Arizona State University useron 02 July 2018