13
Characterization of genetic diversity in the nematode Pristionchus pacificus from population-scale resequencing data Christian R¨ odelsperger 1 , Richard A. Neher 2 , Andreas M. Weller 1 , Gabi Eberhardt 1 , Hanh Witte 1 , Werner E. Mayer 1 , Christoph Dieterich 3 , Ralf J. Sommer 1 1 Department for Evolutionary Biology, Max Planck Institute for Developmental Biology,72076, T¨ ubingen, Germany, 2 Evolutionary Dynamics and Biophysics Group, Max Planck Institute for Developmental Biology, 72076, T¨ ubingen, Germany, 3 Berlin Institute for Medical Systems Biology (BIMSB) at the Max Delbr¨ uck Center for Molecular Medicine Berlin, Robert-R¨ oßle-Str. 10, 13125 Berlin, Germany (Dated: January 10, 2014) The hermaphroditic nematode Pristionchus pacificus is an established model system for com- parative studies with Caenorhabditis elegans in developmental biology, ecology and population genetics. In this study, we present whole genome sequencing data of 104 P. pacificus strains and the draft assembly of the obligate outcrossing sister species P. exspectatus. We character- ize genetic diversity within P. pacificus and investigate the population genetic processes shaping this diversity. P. pacificus is ten times more diverse than C. elegans and exhibits substantial population structure that allows us to probe its evolution on multiple time scales. Consistent with reduced effective recombination in this self-fertilizing species, we find haplotype blocks that span several megabases. Using the P. exspectatus genome as outgroup, we polarized variation in P. pacificus and found a site frequency spectrum (SFS) that decays more rapidly than expected in neutral models. The SFS at putatively neutral sites is U-shaped, which is a characteristic feature of pervasive linked selection. Based on the additional findings i) that the majority of nonsynony- mous variation is eliminated over time scales on the order of the separation between clades, ii) that diversity is reduced in gene rich regions, and iii) that highly differentiated clades show very similar patterns of diversity, we conclude that purifying selection on many mutations with weak effects is a major force shaping genetic diversity in P. pacificus. Genetic variation originates from mutation and re- combination and is removed by selection and drift. Many hermaphroditic nematode species reproduce by self-fertilization with occasional outcrossing events via males, rather than obligatory outcrossing such as in most animals. Similarly, many plant species self-fertilize or reproduce vegetatively. Self-fertilization reduces ef- fective recombination and can result in a decrease of neutral diversity through selection against linked delete- rious variants (background selection) (Charlesworth et al., 1993) or “hitchhiking” with linked beneficial vari- ants (Gillespie, 2000; Maynard Smith and Haigh, 1974). In addition, linkage between loci with non-neutral variation leads to Hill-Robertson interference (Hill and Robertson, 1966) and more generally to a reduced effi- cacy of selection (Barton and Charlesworth, 1998). Such random associations of variants with genetic back- grounds of different fitness generate a stochastic force – genetic draft – distinct from common genetic drift (Gillespie, 2000). In addition to reducing neutral ge- netic diversity, draft results in different patterns of vari- ation that can be used to disentangle the roles of drift and draft. Even though the term draft was introduced in the context of recurrent hitch-hiking, we use it here more generally to refer to the effect of linked selection of any kind as it has been shown that the nature of the selected variation is immaterial as long as many loci contribute to fitness diversity (Neher and Hallatschek, 2013). The prediction that neutral diversity should be negatively correlated with the local recombination rate has been verified in populations of Drosophila and Caenorhabdi- tis briggsae (Begun and Aquadro, 1992; Begun et al., 2007; Cutter and Choi, 2010). A comprehensive re- view of the theoretical concepts of linked selection and the empirical evidence for different scenarios can be found in (Cutter and Payseur, 2013; Neher, 2013). The hermaphroditic nematode Pristionchus pacificus has a global distribution and a clearly defined ecological niche, i.e. it is found with scarab beetles in a necromenic association: nematodes infest beetles as arrested dauer larvae, wait for the insect’s death to resume develop- ment and feed on the growing microbes on the carcass of the beetle. The P. pacificus system was introduced as a satellite model to the well established C. elegans multi- ple comparative studies have revealed divergent patterns in vulva formation, dauer development, and feeding be- havior (Bumbarger et al., 2013; Kienle and Sommer, 2013; Sinha et al., 2012). Similar to C. elegans, P. pacificus has six chromo- somes, which were shown to be largely macrosyntenic to their C. elegans counterparts (Lee et al., 2003; Srini- vasan et al., 2002). Nonetheless, the genomes exhibit substantial differences in terms of size, gene content, and repeat patterns (Dieterich et al., 2008; Molnar et al., 2012). Furthermore, the generation of a genetic linkage map (Srinivasan et al., 2002) and a combined physical map (Srinivasan et al., 2003) revealed large differences to C. elegans with regard to the observed recombination patterns. There is limited interference in P. pacificus and multiple crossover events were observed in all 48 in- Genetics: Early Online, published on January 17, 2014 as 10.1534/genetics.113.159855 Copyright 2014.

Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

Characterization of genetic diversity in the nematode Pristionchus pacificusfrom population-scale resequencing data

Christian Rodelsperger1, Richard A. Neher2, Andreas M. Weller1, Gabi Eberhardt1, Hanh Witte1, Werner E.Mayer1, Christoph Dieterich3, Ralf J. Sommer1

1Department for Evolutionary Biology, Max Planck Institute for Developmental Biology,72076, Tubingen, Germany,2Evolutionary Dynamics and Biophysics Group, Max Planck Institute for Developmental Biology, 72076, Tubingen,Germany, 3Berlin Institute for Medical Systems Biology (BIMSB) at the Max Delbruck Center for Molecular MedicineBerlin, Robert-Roßle-Str. 10, 13125 Berlin, Germany

(Dated: January 10, 2014)

The hermaphroditic nematode Pristionchus pacificus is an established model system for com-parative studies with Caenorhabditis elegans in developmental biology, ecology and populationgenetics. In this study, we present whole genome sequencing data of 104 P. pacificus strainsand the draft assembly of the obligate outcrossing sister species P. exspectatus. We character-ize genetic diversity within P. pacificus and investigate the population genetic processes shapingthis diversity. P. pacificus is ten times more diverse than C. elegans and exhibits substantialpopulation structure that allows us to probe its evolution on multiple time scales. Consistentwith reduced effective recombination in this self-fertilizing species, we find haplotype blocks thatspan several megabases. Using the P. exspectatus genome as outgroup, we polarized variation inP. pacificus and found a site frequency spectrum (SFS) that decays more rapidly than expected inneutral models. The SFS at putatively neutral sites is U-shaped, which is a characteristic featureof pervasive linked selection. Based on the additional findings i) that the majority of nonsynony-mous variation is eliminated over time scales on the order of the separation between clades, ii)that diversity is reduced in gene rich regions, and iii) that highly differentiated clades show verysimilar patterns of diversity, we conclude that purifying selection on many mutations with weakeffects is a major force shaping genetic diversity in P. pacificus.

Genetic variation originates from mutation and re-combination and is removed by selection and drift.Many hermaphroditic nematode species reproduce byself-fertilization with occasional outcrossing events viamales, rather than obligatory outcrossing such as inmost animals. Similarly, many plant species self-fertilizeor reproduce vegetatively. Self-fertilization reduces ef-fective recombination and can result in a decrease ofneutral diversity through selection against linked delete-rious variants (background selection) (Charlesworthet al., 1993) or “hitchhiking” with linked beneficial vari-ants (Gillespie, 2000; Maynard Smith and Haigh,1974). In addition, linkage between loci with non-neutralvariation leads to Hill-Robertson interference (Hill andRobertson, 1966) and more generally to a reduced effi-cacy of selection (Barton and Charlesworth, 1998).Such random associations of variants with genetic back-grounds of different fitness generate a stochastic force– genetic draft – distinct from common genetic drift(Gillespie, 2000). In addition to reducing neutral ge-netic diversity, draft results in different patterns of vari-ation that can be used to disentangle the roles of driftand draft. Even though the term draft was introduced inthe context of recurrent hitch-hiking, we use it here moregenerally to refer to the effect of linked selection of anykind as it has been shown that the nature of the selectedvariation is immaterial as long as many loci contribute tofitness diversity (Neher and Hallatschek, 2013). Theprediction that neutral diversity should be negativelycorrelated with the local recombination rate has been

verified in populations of Drosophila and Caenorhabdi-tis briggsae (Begun and Aquadro, 1992; Begun et al.,2007; Cutter and Choi, 2010). A comprehensive re-view of the theoretical concepts of linked selection andthe empirical evidence for different scenarios can be foundin (Cutter and Payseur, 2013; Neher, 2013).

The hermaphroditic nematode Pristionchus pacificushas a global distribution and a clearly defined ecologicalniche, i.e. it is found with scarab beetles in a necromenicassociation: nematodes infest beetles as arrested dauerlarvae, wait for the insect’s death to resume develop-ment and feed on the growing microbes on the carcassof the beetle. The P. pacificus system was introduced asa satellite model to the well established C. elegans multi-ple comparative studies have revealed divergent patternsin vulva formation, dauer development, and feeding be-havior (Bumbarger et al., 2013; Kienle and Sommer,2013; Sinha et al., 2012).

Similar to C. elegans, P. pacificus has six chromo-somes, which were shown to be largely macrosyntenicto their C. elegans counterparts (Lee et al., 2003; Srini-vasan et al., 2002). Nonetheless, the genomes exhibitsubstantial differences in terms of size, gene content, andrepeat patterns (Dieterich et al., 2008; Molnar et al.,2012). Furthermore, the generation of a genetic linkagemap (Srinivasan et al., 2002) and a combined physicalmap (Srinivasan et al., 2003) revealed large differencesto C. elegans with regard to the observed recombinationpatterns. There is limited interference in P. pacificusand multiple crossover events were observed in all 48 in-

Genetics: Early Online, published on January 17, 2014 as 10.1534/genetics.113.159855

Copyright 2014.

Page 2: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

2

dividuals of the meiotic mapping panel that the geneticlinkage map was built upon. Also, in contrast to C. el-egans, the P. pacificus recombination patterns along thechromosomes do not show an obvious distinction betweencentral regions with high gene density and low recombi-nation rates as opposed to chromosomal arms with lowgene density and high recombination rate (Srinivasanet al., 2003).

Phylogenetic studies have shown thathermaphroditism has evolved multiple times withinthe genera Caenorhabditis and Pristionchus (Denveret al., 2011). Recently, a very closely related outcrossingsister species, P. exspectatus, has been isolated fromstag beetles in Japan that still forms viable but sterileF1 hybrids with P. pacificus (Kanzaki et al., 2012).Similar to the comparative studies of selfing and out-crossing species in plants (Foxe et al., 2009; Wrightet al., 2002), the close phylogenetic relationship betweenP. pacificus and P. exspectatus provides a powerfulframework for studying genome evolution and asso-ciated population genetic processes in predominantlyself-fertilizing nematodes.

In this study, we present the draft genome of P. exspec-tatus and whole-genome sequencing data of 104 P. paci-ficus strains. Using P. exspectatus as outgroup, we in-vestigated population structure and genetic diversity ofP. pacificus. Our findings suggest that genetic diver-sity in P. pacificus is shaped by linked selection such asbackground selection on many weakly deleterious poly-morphisms.

I. RESULTS

A. The draft genome of P. exspectatus

We sequenced an inbred strain of P. exspectatus to anapproximate coverage of 97× on the Illumina platformusing libraries with different insert sizes (see Methodsfor details). The genome draft contains 167Mb of assem-bled sequence in 4,412 scaffolds spanning a total lengthof 177.6Mb with an N50 value of 142kb (Table 1). Re-alignment of the sequencing data placed 96-97% of readsonto the assembly, 95% of mapped pairs were in correctorientation.

Despite full-sibling inbreeding for ten generations weobserved ∼120,000 positions with ambiguous base callsfrom the realignments of sequencing data of the in-bred P. exspectatus strain. In order to test, whetherthese positions represent remaining heterozygosity noteliminated by inbreeding, we sequenced the non-inbreedP. exspectatus strain and identified ∼330,000 ambigu-ous positions indicating that the degree of ambiguousbase calls was reduced to 38% during the inbreeding pro-cess. The theoretical reduction of heterozygosity duringa full-sibling inbreeding process can be approximated byht ≈ 1.17h0 × (0.809)t, where h0 denotes the initial het-erozygosity and t is the number of inbred generations

(Naglyaki, 1992). After ten generations, the expectedheterozygosity h10 should be 14% – much lower then theobserved 38%. Although we cannot rule out a decreasedefficacy of inbreeding due to recessive lethal alleles or in-versions, these results suggest that up to 57% of the ob-served ambiguous positions do not represent remainingheterozygosity. We hypothesized that these ambiguouspositions may be due to overcompressed repetitive se-quences or recently duplicated regions that could not beresolved by the genome assembler. In accordance withthis, we found a consistently higher read coverage in re-gions that cover roughly 60% of all ambiguous positionsand spanning around 15% of the genome assembly (Sup-plemental Fig. 1). In further support for unresolvedrepetitive and recently duplicated regions as a potentialsource for the ambiguous positions, we even found tri-allelic positions (Supplemental Fig. 1). In contrast to aprevious analysis of assemblies of outcrossing nematodes(Barriere et al., 2009), we found ambiguous regionsnot only on autosomes but also on the sex chromosome.Thus, we estimate an actual genome size of P. exspecta-tus of around 200Mb. Whole genome alignments betweenP. pacificus and P. exspectatus covered 79.2 and 78.9Mbof uniquely alignable sequence, respectively, and revealed5.2 million substitutions and 1.2 million indels indicatinga sequence divergence of approximately 10% that is dis-tributed uniformly across the chromosomes.

Using RNA-seq derived gene models as training set,we predicted 24,642 complete gene models (28,236 in-cluding partial models). Excluding genes with ambigu-ous gene structures due to alternative isoforms, evalu-ation of single-transcript RNA-seq gene models showedthat 80% of expressed exons overlapped predicted ex-ons on the same strand. For 58% of those exons, startand end positions were predicted correctly at nucleotideresolution. However, 44% of predicted exons showedno evidence for expression indicating either false pre-dictions or constitutively low and/or highly spatiotem-porally restricted expression. Protein domain annota-tion using PFAM showed strong correlations between thetwo species (Spearman’s ρ = 0.77, P < 10−15). Onlytwo domain families (PF01498 and PF01359) showed amore than 20-fold enrichment in P. pacificus relative toP. exspectatus (P < 10−15, Fisher’s exact test); thesedomains correspond to DNA transposons of the marinerfamily suggesting that this family has been active in theP. pacificus lineage following speciation. However, basedon inferred structural variations such as duplication anddeletions (see below), we found no evidence that thistransposon family is still active in P. pacificus.

In this study, we use the P. exspectatus genome se-quence as outgroup for population genetic analysis ofP. pacificus. A detailed comparison to P. pacificus andanother closely related species, P. arcanus, will be pre-sented elsewhere.

Page 3: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

3

B. Sequencing of 104 natural isolates

To investigate genetic diversity and the underlyingpopulation genetic processes, we selected 104 strainsof P. pacificus, including the reference strain PS312for second-generation sequencing. Strains were selectedbased on biogeography, beetle association, and mi-crosatellite patterning (Morgan et al., 2012) (Fig. 1A).Sixty-one of the 104 sequenced strains are from the Is-land of La Reunion in the Indian Ocean, which representsa hot-spot of P. pacificus biodiversity and has been thefocus of recent population genetic studies on P. pacificus(Morgan et al., 2012). All strains were inbred for atleast ten generations and known to be largely homozy-gous at microsatellite markers. We sequenced genomicDNA of these strains on the Illumina platform with meancoverage ranging from 6 to 37×. Most strains were se-quenced to a coverage of 9× (median, interquartile range(IQR): 8 – 12), while a few strains were sequenced muchdeeper.

Individual strains showed between 23,000 and 1.4 mil-lion single nucleotide variants (SNVs) relative to the ref-erence genome (Supplementary Table 1). In total, weidentified 7.1 million SNVs and 2.1 million indel (< 100bp) positions. We assessed the quality of variant calls bycomparing the SNVs derived from short read alignmentswith the SNVs derived from whole genome alignments ofthe Sanger sequenced P. pacificus mapping strain PS1843from Washington (Dieterich et al., 2008). Out of470,000 SNV calls that are covered by both platforms,98% were in agreement, indel calls showed an agreementof 79-96% (see Methods).

Despite extensive inbreeding (at least 10 generations),around 6% (median, IQR: 5.2 – 7.3%) of SNVs for allstrains were called as heterozygous (Supplemental Ta-ble 1). As in the case of P. exspectatus, several lines ofevidence suggest that a large portion of the observed het-erozygous signal is due to recent duplications. First, 29%(median, IQR: 26 – 31%) of these heterozygous SNVs fallinto duplications that are larger than two kilobases (Sup-

Feature Unit P. pacificus P. exspectatus

Assembly size [Mb] 172.5 177.6

Assembled sequence [Mb] 153.2 167.0

N50 scaffold size [Mb] 1.25 0.14

GC-content 42.8% 42.8%

Predicted genes 28,666 24,642

Coding sequence [Mb] 27.5 27.1

Gene length [kb] 1.9 (1.0-3.5) 2.5 (1.3–4.5)

Transcript length [kb] 0.7 (0.4-1.2) 0.8 (0.4–1.4)

Exons per gene 7 (4-12) 8 (5–14)

Exon length [bp] 87 (65-115) 86 (63–112)

Intron length [bp] 119 (58-242) 128 (63–253)

Table I Features of the P. pacificus and P. exspectatus genomesequences. Gene features are given as median and IQR.

Figure 1 (A) Worldwide sampling of P. pacificus strains. Cir-cle sizes indicate the numbers of sequenced strains. Colors de-note the predominant mitochondrial and microsatellite cladeper region (Morgan et al., 2012). (B) Population structureof P. pacificus natural isolates visualized as a split-networkbased on average number of substitutions within one mil-lion variables sites that were genotyped in all strains. Thethree most deeply sampled clades are used for further anal-ysis. Clade C (N=44) is almost exclusively sampled from laReunion, A1 (N=15) contains strains from northern Americaand Asia, and A2 (N=16) consists of south and central Amer-ican strains as well as strains from the Indian ocean. Col-ors indicate the predominant mitochondrial and microsatelliteclade per branch (Morgan et al., 2012).

plemental Table 1), which are the smallest duplicatedregions that can be reliably detected based on read cov-erage. Second, a subset of 20 strains that was inbred formore than 30 generations shows similar levels of heterozy-gosity than the other strains. Third, the isolate RS5410from La Reunion (Fig.1B), which was previously iden-tified as an admixed strain (Morgan et al., 2012) andshould therefore show a higher level of heterozygosity,only exhibits 4.4% of ambiguous variant calls, of which27% fall into large duplicated regions. In order to mini-mize the effect of coverage fluctuations and apparent het-erozygosity, most population genetic analysis presentedbelow is restricted to positions that were sufficiently cov-ered in all analyzed strains and did not show any signalof heterozygosity (see Methods).

As mentioned above, we detected structural variantsand copy number variations between two kilobase up toa megabase using a method (cnv-seq, Xie and Tammi(2009)) that compares differences in read depth relative

Page 4: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

4

10−3

10−2

10−1

ChrI1 Mb

10−3

10−2

10−1

ChrII1 Mb

10−3

10−2

10−1

ChrIII1 Mb

π (A2 ) π (C) π (C, A2 ) d10

−310

−210

−1

ChrIV1 Mb

10−3

10−2

10−1

ChrV1 Mb

10−3

10−2

10−1

ChrX1 Mb

Figure 2 Genome-wide divergence (d) to the sister species P. exspectatus and nucleotide diversity π within and between cladesC and A2 are plotted along the chromosomes. π values were calculated in 100kb windows and smoothed with a runningaverage of 1Mb for ease of presentation. Supercontigs (> 1Mb) with markers on the genetic map (Srinivasan et al., 2003)were concatenated to visualize the chromosomal distribution. The dashed lines denote supercontig boundaries with unknownphysical distance. Diversity within different clades is very well correlated suggesting that similar population genetic factorsshape this variation.

to resequencing data of the reference strain (Supplemen-tary Fig. 2). Compared to the reference genome, 7.2%(median, IQR: 6.8 – 7.4%) and 2.4% (median, IQR: 2.3 –2.6%) of the genome was predicted as deleted and dupli-cated, respectively. We verified predicted deletions withPCR amplification experiments for three cellulase genesin 24 strains and found perfect agreement (Mayer et al.,2011).

C. Genome-wide levels of nucleotide diversity

Figure 1B shows the phylogenetic relationship of the104 sequenced strains as a split-network. The popula-tions fall into a number of clearly separated clades andwe denote the most deeply sampled clades with at least15 strains each as A1, A2, and C. The clades suggestedby diversity in the nuclear genome largely agree with theprevious designation based on the mitochondrial genomeand microsatellite data indicated by different colors inthe figure (Morgan et al., 2012). The strong popula-tion subdivision is in contrast to the lack of populationstructure in C. elegans (Phillips, 2006) but mirrors find-ings in C. briggsae (Cutter et al., 2006). Consistentwith the strong population structure, the first two prin-cipal components (PCs) of the SNV data explain 22.9%and 16.3% of the global variability, respectively (Supple-mental Fig. 3). The clades A1, A2, and C are clearlyseparated along the first two PCs. Further analysis ofclade C, the most deeply sampled clade with a total of44 strains, did not reveal additional clusters but only aslight signal reflecting the local geography on La Reunion

Island. We focus our population genetic analyses belowon this homogeneous clade C to minimize the influence ofpopulation structure. Nevertheless, the highly structuredpopulation of P. pacificus and its outgroup P. exspecta-tus allows us to investigate patterns of sequence evolu-tion over a wide range of time scales. The differencesthat accumulated between P. pacificus and P. exspec-tatus clearly contribute to divergence between species,while mutations that segregate within each of the cladesare relatively young genetic diversity. Comparisons be-tween clades have features of diversity and divergence.

Figure 2 shows the nucleotide diversity within cladesA2 and C (see Supplemental Fig. 4 for comparisons in-cluding clade A1), the average distances between thesetwo clades, and the distance between P. pacificus andP. exspectatus for each of the six chromosomes. The aver-age genome-wide intra-clade diversity in 100kb windowsis 1.9 and 2.2 ×10−3 for clade A2 and C, respectively.Strains from clade A2 and C differ at an average fractionof 7.5 ×10−3 of all sites. The genome-wide patterns ofdiversity within clades are well correlated between clades(Spearman’s ρ = 0.75, P < 10−15 for clades A2 and C in100kb windows). Similarly, distance between the clades iscorrelated with diversity within clades C and A2 (Spear-man’s ρ = 0.63 and 0.66, P < 10−15, respectively). How-ever, the distances between clades fluctuate less in rela-tive terms than diversity within clades (standard devia-tion of log10 distance values in 100kb windows is 0.21 asopposed to approx. 0.30 for fluctuations of within cladediversity). The average divergence between P. pacificusand P. exspectatus is 9.3% with a typical fluctuation ofless than 20% (windows of 100kb, standard deviation of

Page 5: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

5

log10 d = 0.07, Fig. 2).This homogeneous divergence between the sister

species suggests a constant accumulation of mutationsand the absence of large-scale mutation rate variation.Hence the strongly fluctuating diversity patterns withineach clade are likely due to population genetic processesthat differentially affect genetic diversity in different re-gions of the genome such as selective sweeps or back-ground selection (Hudson and Kaplan, 1995; May-nard Smith and Haigh, 1974; Nordborg et al., 1996).Even though the diversity patterns in different clades arevery similar, most intra-clade diversity is due to privatevariation, rather than diversity shared in the common an-cestral population (3% and 7% of polymorphic positionsin clade C and A2 are polymorphic in both, the jointSFS of different clades are shown in Supplementary Fig-ure 5). This suggests that genetic diversity is shaped byprocesses that act similarly but independently in differentclades. The most plausible explanation is background se-lection, which depends primarily on the local deleteriousmutation rate and recombination(Hudson and Kaplan,1995; Nordborg et al., 1996). These two processes arenot expected to be strongly influenced by the environ-ment and should affect diversity in similar matter inde-pendent of location. Local adaptation and the associ-ated selective sweeps, in contrast, should result in differ-ent diversity patterns within different clades (Kaweckiand Ebert, 2004). The alternative explanation of globalselective sweeps seems unlikely, as we do not find longhaplotypes that are shared among clades (Supplemen-tary Fig. 6). The correlation between across-clade andwithin-clade diversity likely stems from ancestral diver-sity. This ancestral diversity is expected to have had achromosomal profile similar to the present day within-clade diversity. Together with homogeneous accumula-tion of divergence, we expect the inter-clade distance tobe correlated with the intra-clade diversity, but with asmaller relative variability. Both of these expectationsare consistent with our observations.

D. Deleterious variants at high frequencies

Restricting the analysis of the global P. pacificus di-versity to coding regions, we find a mean silent site di-versity πsi = 1.4 × 10−2 and a nonsynonymous site di-versity of πns = 4.5× 10−3. The resulting pN/pS ≈ 0.32suggests widespread purifying selection that results inpruning of two out of three nonsynonymous mutations.Note, however, that these genome-wide measures of syn-onymous and nonsynonymous diversity are dominated byinter-clade comparisons. Restricted to clade C, we findan average pN/pS ratio of 0.39, which indicates that asubstantial fraction of nonsynonymous mutations wereonly pruned at the inter-clade level and segregate withinclades.

To quantify this dependence of purifying selection onthe time scale of separation, we calculated the density of

0.000 0.005 0.010 0.015 0.020 0.025

0.25

0.30

0.35

0.40

0.45

0.50

0.55

δsi

δ ns

δ si

within clade across clade

Figure 3 Individual dots represent δns/δsi ratios of all pair-wise comparisons of the 104 P. pacificus strains and were plot-ted across different time scales as measured in δsi. The linerepresents a smoothed average δns/δsi ratio of all measure-ments at a given δsi. δns/δsi ratios decrease with distance(δsi) from 0.5 to 0.3, suggesting that around 50% of nonsyn-onymous mutations have been selected against at very shorttime periods and were not observed as variable. In addition,around 20% of nonsynonymous variations are pruned with in-creasing distance. Assuming that this drop is due to purifyingselection, we conclude that around 20% of nonsynonymous areweakly deleterious and segregate as high frequency polymor-phisms.

synonymous differences, δsi, and nonsynonymous differ-ences, δns, for all pairwise comparisons of the 104 strains.As comparisons across clades have features of diversityand divergence we have chosen to represent the ratiobetween nonsynonymous and synonymous differences asδns/δsi rather than pN/pS or dN/dS. Fig. 3 shows thedependence of this ratio δns/δsi on the separation mea-sured by the synonymous distance δsi (stochastic fluctu-ations in δsi are negligible since we are comparing entiregenomes and the total number of differences in each com-parison is large). The most closely related strains withinone clade show δns/δsi of up to 0.5. With increasingdistance, δns/δsi drops from 0.5 to 0.3 when compar-ing strains between clades. In other words, we find thatnonsynonymous differences accumulate with a decreasingrate as we go to larger degrees of separation. Finally, wefind dN/dS=0.14 for the comparison of 1:1 orthologs be-tweeen P. pacificus and P. exspectatus. Note that as wego from comparisons of strains within one clade to inter-clade comparisons and eventually to inter-species com-parison, the interpretation of δns/δsi gradually changesfrom pN/pS to dN/dS. For a more indepth discussion of

Page 6: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

6

time dependent dN/dS, see (Mugal et al., 2014; Rochaet al., 2006).

From the observation δns/δsi ≈ 0.5 for closely relatedstrains, we conclude that 50% of nonsynonymous substi-tutions are so deleterious that they are rarely seen in wildisolates and are eliminated quickly. In contrast, another20% of nonsynonymous substitutions are weakly deleteri-ous such that they are pruned only on longer time scalesand segregates as high frequency polymorphisms. Be-low, we will complement these observations for genomewide ratios of nonsynonymous and synonymous differ-ences with an analysis of the site frequency spectra (SFS)of synonymous and nonsynonymous mutations.

The most immediate consequence of abundant dele-terious mutations is a reduction of diversity by back-ground selection (Charlesworth et al., 1993). In-deed, we find a strong negative correlation between thefraction of coding sequences with diversity (Spearman’sρ = −0.35, P < 10−15 for 100 kb windows in clade C),a finding that has also been observed in the selfing plantArabidospsis thaliana (Nordborg et al., 2005). Simi-larly, we expect a positive correlation between geneticdiversity and recombination rates (Begun et al., 2007;Cutter and Choi, 2010), but the only available geneticmap (Srinivasan et al., 2003) has insufficient resolutionto be useful for this purpose. While the observed cor-relation on its own could also be the result of recurrenthitch-hiking in gene rich regions, the strong correlationof the diversity patterns across clades suggest purifyingselection as a likely cause, see discussion above.

E. Site frequency spectra and linked selection

Rare SNVs tend to be young while common SNVs aretypically old (Kimura and Ohta, 1973). Hence the his-togram of SNVs present in k out of n strains providesa rich summary of genetic diversity that is informativeabout the demographic and evolutionary history of thepopulation. Deciding on whether an allele is common orrare requires polarization, i.e., an inference of the ances-tral state at the locus. To infer the ancestral state ofclade C, we broke the genome into 20kb intervals, builtmaximum likelihood trees on these blocks using fasttree(Price et al., 2009) and rooted these trees with thegenome of P. exspectatus. Blocks of 20kb are in substan-tial LD and contain sufficient SNVs to resolve most ofthe genealogical relationships between strains. Next, weused a probabilistic model to infer the ancestral state ateach internal node of the tree (see Methods). Variationsof this strategy (different block length, different substitu-tion models, different tree building algorithms, differentancestral reconstruction software) yielded similar results.

There are many more rare derived mutations thancommon ones. Hence if even a small fraction of raremutation are incorrectly polarized, the number of highfrequency derived mutations is overestimated substan-tially. For this reason, many authors “fold” the SFS and

consider only the minor allele frequency irrespective ofpolarization. However, valuable information is lost by“folding”. We therefore decided to present the unfoldedSFS and we provide independent evidence that our polar-ization is sufficiently accurate to allow a faithful charac-terization of the SFS at high frequencies. Fig. 4A showsthe ratio of nonsynonymous to synonymous mutations atfrequencies above a threshold ν. The ratio of nonsynony-mous to synonymous SNVs is monotonically decreasingas expected if purifying selection prunes the majority ofamino-acid substitutions. If most of the SNVs inferred tobe at high frequency were wrongly polarized rare alleles,this curve should rise again with ν → 1.

Fig. 4B shows the polarized SFS of clade C on a doublelogarithmic scale such that power-laws show as straightlines. At frequencies below 20%, the observed SFS iscompatible with a 1/k decay as expected in neutrallyevolving populations of constant size (Wakeley, 2008).At higher frequencies, however, the SFS decreases muchmore rapidly, before increasing again for alleles close tofixation. Similar U-shaped SFS have been observed inplants (Cao et al., 2011). At intermediate derived allelefrequencies between 20-40%, the slope of the SFS is com-patible with 1/k2 indicated as dashed line. The 1/k2

behavior and the non-monotonicity is expected if thedominant force changing allele frequencies is selection atlinked (genetic draft) loci regardless of whether this vari-ation is positively or negatively selected (Bravermanet al., 1995; Neher and Hallatschek, 2013; Neherand Shraiman, 2011).

Compared to the almost 100 fold variation in the SFSbetween singletons and alleles at intermediate frequen-cies, synonymous, nonsynonymous and noncoding poly-morphisms all follow very similar distributions (Fig. 4C).The relative abundance of synonymous polymorphismsincreases only by a factor of two as we go from low tohigh frequency (see Fig. 4A). This similarity suggeststhat the dynamics of polymorphisms is quasi-neutral inthe sense that their fate is largely determined by theirgenetic background rather than their own effect on fit-ness. We observe a systematic excess of synonymous overnonsynonymous mutations at frequencies above 0.6 con-sistent with the gradual manifestation of purifying selec-tion on old and frequent alleles (P < 10−6, Fisher’s exacttest, Fig. 4C). While the SFS is quite insensitive to thetype of mutation, stratifying the SFS by the overall nu-cleotide diversity in surrounding regions reveals a strongdependence of the shape of the SFS on coalescence time(Fig. 4C) as predicted by models of abundant linked se-lection in recombining populations (Neher et al., 2013).

To complement our inference of mutational effectsbased on pairwise comparisons of strains (see above andFig. 3), we used the web-server DFE-α (Keightley andEyre-Walker, 2007) to infer parameters of the DFEusing the synonymous and nonsynonymous SFS withinclade C. DFE-α preferred a model with a 4.5 fold popu-lation size increase (N2 = 4.5N1) roughly N2 generationsin the past over a model with a single population size

Page 7: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

7

0.0 0.2 0.4 0.6 0.8 1.0

0.70.80.91.01.11.21.3

thresholdν

NnsNsiofSNVswithk/n>ν

derived allele frequency (k/n)

density

10−3

10−2

0.1

1

0.01 0.1 1

1/k2

1/k

0.0 0.2 0.4 0.6 0.8 1.0

derived allele frequency (k/n)

density

10−3

10−2

0.1

1 synonymousnonsynonymnoncodingπ<5x10−4

π>5x10−3

A B C

Figure 4 (A) Monotonic decrease of the ratio of the numbers of nonsynonymous (Nns) and synonymous (Nsi) SNVs at derivedallele frequencies (k/n) above a threshold ν. The ratios represent unnormalized ratios of raw counts of nonsynonymous andsynonymous SNVs in the site frequency spectrum (SFS). As these sites are called in all 104 P. pacificus strains and are uniquelyalignable to the outgroup P. exspectatus, this subset of sites is likely under stronger purifying selection than the genome-wideaverage. The monotonic decrease with increasing ν is consistent with the effect of purifying selection. There is no signal ofpolarization errors which would manifest as an uptick at values of ν → 1. (B) SFS of clade C is shown on a double logarithmicplot. If neutral diversity is shaped by constant genetic drift, one expects that the number of derived alleles present in k out ofn strains decays as 1/k, indicated as straight solid line. The dashed line indicates the corresponding expectation, 1/k2, undera genetic draft model (for rare alleles). At frequencies below 10%, the SFS is proportional to 1/k, at intermediate frequenciesbetween 10% and 50%, the SFS is steeper and compatible with 1/k2. (C) The SFS decays less rapidly in regions of high thanof low diversity (100 kb windows). In contrast, distinct functional categories have almost indistinguishable SFSs. Only atfrequencies above k/n = 0.6, do we see a systematic excess of synonymous over nonsynonymous mutations (P < 10−6, Fisher’sexact test).

(log-likelihood difference 325). For the two-epoch model,DFE-α estimated that 45% of nonsynonymous mutationshave deleterious effects in excess of N2|s| = 100, consis-tent with our previous results. Another 37% of nonsyn-onymous mutations are estimated to be approximatelyneutral with N2|s| < 1, which is compatible with theobserved δns/δsi at larger time scales (Fig. 3).

The shape of the SFS is sensitive to past demography,direct selection, linked selection, and genetic drift. Dis-entangling the effects of these process is challenging asthey have overlapping characteristics. We argue that thedominant features of the SFS observed in clade C are theresult of linked selection. If direct selection was a domi-nant force, the SFS of synonymous, nonsynonymous, andnon-coding sites would have qualitatively different shape.We observe the effect of purifying selection against non-synonymous mutations at high frequencies, but the dif-ference is small. Effects of demography are harder todiscount. Both exponential expansion and linked selec-tion result in a 1/k2 decay of the SFS at low and in-termediate frequencies (Neher and Shraiman, 2011).The SFS we observe, however, are non-monotonous andit is impossible to obtain non-monotonous SFS in anyneutral model of one population irrespective of past pop-ulation size changes (see appendix B in Sargsyan andWakeley (2008)). Models of linked selection or geneticdraft predict precisely this type of non-monotonous SFS(Neher and Hallatschek, 2013). Furthermore, theshape of the SFS differs in regions of high and low di-

versity. Signatures of draft are more pronounced in lowdiversity regions (Fig. 4C). This observation is consis-tent with the expectation that draft reduces coalescencetime and hence diversity and that this can differ alongthe genome depending on the local mutational input ofselected diversity and recombination rate (Neher et al.,2013). Young alleles at the rare end of the frequencyspectrum are compatible with genetic drift and the bumpin the SFS around 15% might reflect residual populationstructure within clade C. Overall, however, P. pacificusis not compatible with a neutral model. Strong distor-tions of genealogies by purifying selection on many weakeffect alleles are a likely explanation for the observedSFS (Neher and Hallatschek, 2013; Walczak et al.,2012). However, our understanding of migration betweenthe subpopulations of P. pacificus remains limited andto what extent reintroduction of ancestral alleles by mi-grations from different subpopulations contributes to theobserved patterns remains subject of future investigation.

F. Linkage disequilibrium and haplotype structure

Finally, we investigated the pattern of linkage disequi-librium (LD) in P. pacificus. Figures 5A and B showthe average LD between pairs of loci measured as r2 asa function of the physical separation between the locifor the global sample of 104 strains and clades C, A1,and A2. In all cases, LD drops over a distance of about

Page 8: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

8

<0.1

<0.2

<0.3

<0.4

<0.6

>0.8

I II III IV V X

BA

C

Figure 5 (A) Average r2 values between pairs of loci for all104 strains and clade C. LD drops over the first 20kb but staysat a relatively high level across longer distances (solid lines)and even across chromosomes (dashed lines). We attributethis high level of background LD to the strong differentiationbetween subclades as a result of geographic separation. Re-stricted to clade C, local LD is higher, while inter-chromosomeLD is low. (B) Average r2 values for clades A1 and A2. Un-like clade C, both clades show substantial background LD,which can also be explained by geographic separation withinthese clades, i.e. individual strains within the same cladewere collected on different continents. (C) Average LD be-tween all 100kb windows for clade C across the genome wascalculated as average r2 for 10 pairs of biallelic SNVs with5-95% allele frequency. LD across chromosomes is virtuallyabsent indicating the effective reshuffling of chromosomes inoutcrossing events. However, clade C shows several blocksspanning megabases in strong LD on most chromosomes.

20 kb (which is the scale used to infer the ancestral stateabove). With exception of clade C, all data sets show sub-stantial LD even across chromosomes (r2 = 0.12− 0.18),which we attribute to the strong geographic separationof strains even within clades (see Supplemental Table1). For the more closely related and geographically re-stricted clade C strains, we find very little LD acrosschromosomes (r2 = 0.04), but blocks of local LD, thelargest of which spans several Mb (Fig. 5C). The lackof LD across chromosomes supports our assertion thatclade C is the only well mixed population suitable fordetailed population genetic analysis. A complementaryanalysis of haplotype blocks (see Methods) also revealedshared haplotypes blocks of megabase size between cladeC strains (Supplemental Fig. 6). While long haplotypeblocks conserved across continents were found in C. ele-

gans (Andersen et al., 2012), most of shared haplotypeblocks in P. pacificus are clade-specific. Consistent witha reduction of neutral diversity through linked selection,we find a moderate negative correlation between LD anddiversity (Spearman’s ρ = −0.25, P < 10−15 for 100 kbwindows).

Interpretation of the observed haplotype patterns ishampered by the absence of a high resolution geneticmap for P. pacificus. Whether the long-range LD simplyreflects absence of recombination or whether other popu-lation genetic forces are responsible for maintaining thisLD needs to be addressed in future work. However, wewould like to point out that the previous low-resolutiongenetic map did not find an obvious absence of crossoversin the central regions of the chromosomes (Srinivasanet al., 2003).

II. DISCUSSION

By sequencing 104 strains of P. pacificus and a closelyrelated outgroup, P. exspectatus, we have shown thatP. pacificus contains extensive population structure andoverall levels of diversity are one order of magnitudehigher than in C. elegans (Andersen et al., 2012). How-ever, individual clades show similar levels of nucleotidediversity with a genome-wide profile that is highly cor-related between clades, suggesting that background se-lection is a major force shaping genetic diversity. Theaccumulation of divergence between P. pacificus and itssister species P. exspectatus is homogenous across thegenome. The latter observation is consistent with resultsof the accompanying manuscript presenting a mutationaccumulation experiment that shows that the P. pacificusgenome does not exhibit strong mutation rate variationon large physical scales and that synonymous and non-synonymous mutations occur at their expected frequency(Weller et al. 2014). By analyzing the ratio of synony-mous and nonsynonymous differences between strains atvarious distances and mutations at different frequencies,we characterized the strength of purifying selection oncoding regions. 50% of nonsynonymous mutations are sodeleterious that they are not found in a typical popula-tion sample. Of those nonsynonymous mutations that areobserved, again roughly half are weakly selected againstand pruned over time scales on the order of the separa-tion between clades, which corresponds to roughly 1-2%divergence at silent sites. Combined with our finding thatnucleotide diversity in genomic windows anticorrelatesstrongly with the fraction of coding sequence in 100kbwindows, we conclude that background selection playsan important role in shaping P. pacificus diversity.

Recent theoretical work has shown that backgroundselection cannot be fully described by a reduced effec-tive population size but results in substantial distortionsof genealogies in particular when many weakly deleteri-ous mutations segregate at high frequency (Seger et al.,2010; Walczak et al., 2012). This distortion of genealo-

Page 9: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

9

gies manifests itself in an SFS with a steeper decay and anup-tick at high derived allele frequencies, both of whichare characteristic signatures of genetic draft or linked se-lection (Neher and Hallatschek, 2013; Neher andShraiman, 2011). The signature of genetic draft in theSFS of P. pacificus is consistent with previous studiesthat proposed selection at linked sites as one importantfactor shaping genomic diversity of self-fertilizing nema-todes (Andersen et al., 2012; Cutter and Choi, 2010;Rockman and Kruglyak, 2009). In addition to dis-torted SFS, we also find megabase scale haplotype blocksin strong linkage disequilibrium, which provide a strongopportunity for linked selection.

Whether on top of purifying selection adaptive sub-stitutions or fluctuating selection play a prominent roleremains currently unclear. McDonald-Kreitman typetests (McDonald and Kreitman, 1991) that have beenused to quantify adaptive evolution in Drosophila (Sellaet al., 2009) are very vulnerable to segregating deleteriousmutations (Charlesworth and Eyre-Walker, 2008;Messer and Petrov, 2012) and hence not suitable forP. pacificus. Methods used in Drosophila to detect morerecent adaptations based on signatures of hitch-hiking(Andolfatto, 2007; Macpherson et al., 2007) are alsoinapplicable in organisms with long range LD.

In summary, the combination of high-diversity in astructured populations and a close outgroup has allowedus to study evolution on a variety of times scales spanningdistances from 0.1% to 10% and revealed a consistentdecreasing trend in the ratio of nonsynonymous to syn-onymous mutations (Mugal et al., 2014; Rocha et al.,2006). This trend implies an abundance of weakly dele-terious mutations – at least when their effect is averagedover larger and larger time scales. With respect to thecomparison to C. elegans, our genome-wide analysis ofP. pacificus populations describes a complementary pic-ture that may reflect the substantially different naturalhistories of both nematodes. P. pacificus and C. elegansare both predominantly self-fertilizing species, but theyoccupy distinct ecological niches and it has been specu-lated that the reduction of diversity observed in C. ele-gans which was caused by recent migration patterns andstrong selective sweeps, might be linked to human dis-persal within the few last centuries (Andersen et al.,2012; Phillips, 2006). This scenario is unlikely for thebeetle associated P. pacificus. Despite the differences,both species reveal strong evidence for linked selectionshaping genetic diversity within their genomes. Thus,in addition with studies of self-fertilizing nematodes andplants (Cutter and Choi, 2010; Cutter et al., 2006;Kim et al., 2007; Nordborg et al., 2005), our presenta-tion of diversity within P. pacificus highlights the needto study multiple satellite model systems in order to bet-ter understand common features and distinct patterns ofgenome evolution of self-fertilizing species. Finally, ourcatalogue of natural variation will form the basis for fur-ther studies associating phenotypic variability to geneticvariation within P. pacificus.

III. MATERIALS&METHODS

Genomic library preparation

For preparation of genomic DNA, the MasterPureDNA purification Kit from Epicentre was used, result-ing in high yields of clean DNA. DNA was quantified byQubit measurement and diluted with TE to 20 ng/µl in atotal volume of 55 µl. Genomic libraries were generatedusing the TruSeq DNA Sample Preparation Kit / v2 fromIllumina. DNA was sheared with the default settings us-ing the Covaris S2 System. Following the protocol, endrepair, adenylation and index adapter ligation were per-formed. After running samples on a 2% agarose gel for 90minutes, gel slices ranging in size from 400-500 bp wereexcised resulting in an insert size of approximately 300-400 bp. After fragments were amplified by PCR, librarieswere validated on a Bioanalyzer DNA 1000 chip. All li-braries were diluted to a concentration of 10 nM in 0.1%EB-Tween and pooled to 1, 4, 8, and 12-plexes. For matepair libraries, clean genomic DNA was prepared using theGenomic-tip 100/G Kit from Qiagen. DNA was quanti-fied by Qubit measurement, diluted with Tris-HCl to 150ng/µl in a total volume of 70 µl and was sheared with thegiven settings (using SC 13) with the Hydroshear. ForP. exspectatus, 3 kb and 5 kb Mate Pair libraries weregenerated using the Mate Pair Library v2 Kit for 2-5 kbLibraries from Illumina.

Genome assembly

We sequenced five 150bp paired-end libraries with in-sert sizes between 400-600bp of an inbred P. exspectatusstrain (10 generations of full-sibling inbreeding) on theIllumina Genome Analyzer II. In order to quantify thedegree of ambiguous base calls vs. remaining heterozy-gosity after the 10 rounds of full sibling inbreeding, wealso sequenced one library of the non-inbred wild-typestrain of P. exspectatus, which was not used for the denovo assembly. Reads were error corrected using quakewith k-mer size 19 (version 0.2.2)(Kelley et al., 2010).The same quality masking and trimming procedure as forthe genome sequencing data of 104 strains was applied.The assembly process was carried out in several steps. Apreliminary assembly (k=49, N50=6.8 kb, assembled se-quence=170.3Mb) using only paired end libraries was cre-ated with the SOAP de novo assembler (version 1.05)(Liet al., 2010b). The preliminary assembly was used to re-move ligation sites within 150bp mate pair library readsby iterative alignment (bwa version 0.5.9-r16) (Li andDurbin, 2009) and 3’ truncation. From two libraries(3kb and 5kb), only pairs for which both ends couldunambiguously be mapped to the preliminary assemblywere kept. This method yielded 13,019,927 3kb matepairs from a set of 25,685,920 raw pairs and 15,919,3815kb mate pairs from a set of 31,112,885 raw pairs. Weran the SOAP assembler with paired-end libraries in the

Page 10: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

10

assembly step and the mate-pair libraries as additionalscaffolding information resulting in a second assembly(k=63). To further improve the contiguity of the genomeassembly we screened a single end library of 100bp readsfor spliced reads that showed non-overlapping align-ments to one scaffold spanning potential introns or non-overlapping alignments to different scaffolds by blat (ver-sion 34)(Kent, 2002) which contain further scaffoldinginformation. In the next iteration we used the allpath-LG assembler (version 42069)(Gnerre et al., 2011) withpaired-end reads, mate pairs, spliced reads, and one over-lapping library of 20 million read pairs resampled fromthe previous assembly. A final scaffolding step includ-ing information from 2937 commercially obtained BACsspanning approximately 100kb and an iterative local as-sembly procedure to close intra-scaffold gaps was applied(Boetzer et al., 2011; Li et al., 2010a). A contaminationcheck against 37 E. coli genomes as downloaded from En-sembl Bacteria (release 12) identified 6 scaffolds spanning4.7Mb as contamination (blastn e-value< 10−30). The fi-nal assembly contained 167Mb of assembled sequence in4,412 scaffolds spanning 177.6 Mb with an N50 value of142kb. The largest scaffold encompasses 1,285.6 kb andgenome-wide GC content is 43%. The final assembly ofP. exspectatus was aligned against the Hybrid1 assemblyof P. pacificus using the whole genome alignment toolmugsy (Angiuoli and Salzberg, 2011).

Gene annotation and orthology assignment

Three mixed stage RNA libraries from P. pacificuswere sequenced on an Illumina Genome Analyzer IIx(76bp paired-end): one conventional mRNA library, oneSL1 enriched mRNA library and one SL2 enriched mRNAlibrary (Sinha and Dieterich; manuscript in prepara-tion). The three libraries were used to build a referencetranscriptome with cufflinks/cuffmerge (version 1.3.0)(Trapnell et al., 2010). Open Reading Frames (ORFs)were derived from the inferred transcript sequences withFrameDP (Gouzy et al., 2009). We collected all com-plete coding sequences without frameshifts and usedthem as a training set for gene finding in P. pacifi-cus. We mapped the assembled transcripts of P. paci-ficus to the P. exspectatus genome using BLAT(Kent,2002). The resulting gene structures were used to guidethe spliced alignment of single-end 100nt RNA-seq readsfrom a mixed stage P. exspectatus library with cufflinks.The splice donor and acceptor positions were trainedfor iterative gene finder training using the SNAP genefinder (Korf, 2004). The new model predicts both,trans-spliced genes as well as normal genes. All availablesplice junction data were provided as external evidencein the gene prediction step. For P. exspectatus, the newlytrained gene model of P. pacificus was used to create aninitial set of candidate genes with the help of splice siteinformation from the P. exspectatus genome annotation.This gene set was used to retrain the gene model to adjust

it to the new target species. We predicted the final geneset by rerunning the gene finder with the new gene modeland all available splice site data. This procedure yielded28,666 P. pacificus gene models and 24,642 P. exspecta-tus gene models. One-to-one orthologous pairs betweenP. pacificus and P. exspectatus were defined by first run-ning the program CYNTENATOR (Rodelsperger andDieterich, 2010) to compute a set of unambiguouslyassignable conserved gene orders that include at least twoconserved genes per species and are unique with respectto the criteria that every gene can occur at most once in agene order alignment. Subsequently, best-reciprocal hitswere identified locally within individual gene order align-ments, and were complemented by global best-reciprocalhits, computed without synteny information.

Alignment and variant calling

Low quality bases in the first 36bp of raw reads witha quality below 20 (error probability=1%) were maskedand reads were trimmed at the first occurrence of a lowquality (< 20) base in the rest of the read. Readswere aligned to the Hybrid1 genome assembly of theP. pacificus PS312 strain (California) using stampy (ver-sion 1.0.12) (Lunter and Goodson, 2011). Duplicatereads were removed and reads were locally realigned us-ing GATK (version 2.1-13). SNVs and small indels werecalled using samtools (version 0.1.18)(Li et al., 2009) ex-cluding positions with > 100x coverage. We excluded allvariant calls with a quality score below 20, coverage belowtwo, short indels, and heterozygous or ambiguous posi-tions with a consensus quality score FQ > 0 as definedby the samtools mpileup command. Large structuralvariations were called using cnv-seq (Xie and Tammi,2009). The previously published low coverage Sanger as-sembly of the P. pacificus PS1843 (Washington) strain(Dieterich et al., 2008) was used to compare variantcalls between the Sanger and Illumina platform. TheSanger assembly was aligned to the PS312 Hybrid1 ref-erence genome using mugsy (version v1r2.2)(Angiuoliand Salzberg, 2011). Based on homozygous 474,285SNV positions that were covered by Sanger data, thegenotyping accuracy was defined as the fraction of SNVscalled by both platforms among all Illumina variant calls.For 938,544 SNVs obtained from the whole genome align-ment of the PS1843 Sanger assembly, only 53% of vari-ant calls agreed with the Illumina data. We attributethis high number of putative false positive SNVs to thelow coverage (1x) of the Sanger data. 79% of Illumina-based indel calls were found to be in agreement with theSanger data based on comparison of the equivalent indelregion(Krawitz et al., 2010); however, 96% of Illumina-based indel calls overlapped with an Sanger-based indelin up to 10bp distance.

Page 11: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

11

Population genetic analysis

With exception of the calculation of genetic diversity,all population genetic analysis was restricted to positionssufficiently covered in all analyzed strains such that confi-dent variant calls could be made, i.e. coverage ≥ 2, sam-tools quality score ≥ 20 and no signal of heterozygosity(samtools consensus quality score FQ < 0). For visualiza-tion of phylogenetic relationships one million SNVs withgenotypes in all strains were randomly selected and con-catenated as input for SplitsTree4 (version 4.12.6) (Hu-son and Bryant, 2006). Principal component analysiswas done with EIGENSOFT (version 3.0) (Pattersonet al., 2006). Shared haplotype blocks were identified bythe program GERMLINE (version 1-5-1)(Gusev et al.,2009). As the majority of variable sites was not cov-ered in all strains (Supplemental Fig. 7), for pairwisecomparisons, we used all variant calls against the refer-ence genome assembly (Hybrid1) to calculate the averagenumber of substitutions. Genetic diversity π was calcu-lated as the average number of substitutions between allpairwise comparisons for the strains of interest. π val-ues were calculated in non-overlapping windows of 100kbsuch that each window contains between 100 and 1000differences. Chromosome-wide plots of diversity weregenerated by concatenating Contigs (> 1Mb) with re-spect to marker positions on the genetic map (Srini-vasan et al., 2003). Contigs that lack any genetic markeror that were shorter than 1Mb were excluded from theseplots (a total of ∼40% of the total assembly is excludedby these criteria). The signal was further smoothed witha running average with window size of 1Mb for ease ofpresentation.

Linkage disequilibrium

For analysis of linkage disequilibrium across all 104strains and within clades A1, A2, and C, we used biallelicSNVs that could be reliably genotyped in all analyzedstrains and with a minor allele frequency ≥ 5% in orderto calculate r2 values. With exception of clade C, the setof all strains as well as clade A1 and A2 strains show ex-tensive background LD across chromosomes (Figure 5Aand B), which reflects the strong geographic separationacross most strains. To visualize patterns of LD withinclade C in at distances larger than 100kb, we calculatedmean r2 values of ten randomly chosen SNV pairs for ev-ery pair of 100kb windows across the whole P. pacificusgenome (Figure 5B).

Analysis of coding regions

The quality filtered variant calls from samtools werecompared to the P. pacificus gene models using a customC++ program designed for classification of indels andnucleotide substitutions in mutant resequencing projects

(Rae et al., 2012). This allowed for the classificationof SNV as synonymous, nonsynonymous, or noncoding,and for calculation of numbers of synonymous and non-synonymous substitutions for all pairwise comparisonsof strains. In order to calculate δsi,δns for any pair ofstrains, the total number of substitutions across all geneswere normalized by the total number of synonymous andnonsynonymous sites of all genes as calculated from theP. pacificus gene models.

Ancestral state inference and SFS

For the inference of ancestral states, homozygous po-sitions sufficiently covered in all strains were implantedinto the reference genome (Hybrid1) to generate hypo-thetical haplotypes with well supported SNVs. As out-group, we used an implanted genome that contained all5.2 million single nucleotide substitutions between thegenome sequences of P. pacificus and P. exspectatus. Inregions that did not align between the two species, noancestral state was inferred. The inference was done bybuilding a maximum likelihood tree using fasttree (Priceet al., 2009) of all clade C strains and the P. exspectatussequence in non-overlapping windows of 20kb (10kb and50kb yield similar results). We inferred the most likelyancestral sequences of all nodes of the tree using a cus-tom python script implementing a variant of the dynamicprogramming algorithm for probabilistic ancestral infer-ence (Pupko et al., 2000). For each of the resulting trees,the most likely sequence at the root of clade C was usedas the ancestral sequence to polarize variation in cladeC. Positions at which the ancestral state was inferredwith less than 95% confidence were excluded. SFS werecalculated separately for synonymous, nonsynonymous,and noncoding SNVs as well as for high and low diver-sity regions, defined by a non-overlapping 100kb windowsacross the P. pacificus genome.

A. Data availability

All reads were submitted to the NCBI Sequence ReadArchive. Variant calls, the P. exspectatus genome assem-bly, and the gene models (version SNAP2012) are avail-able at http://www.pristionchus.org/variome/.

References

Andersen, E. C., J. P. Gerke, J. A. Shapiro, J. R. Criss-man, R. Ghosh, et al., 2012 Chromosome-scale selectivesweeps shape Caenorhabditis elegans genomic diversity. NatGenet 44: 285–290.

Andolfatto, P., 2007 Hitchhiking effects of recurrentbeneficial amino acid substitutions in the Drosophilamelanogaster genome. Genome Res 17: 1755–62.

Angiuoli, S. V., and S. L. Salzberg, 2011 Mugsy: fastmultiple alignment of closely related whole genomes. Bioin-formatics 27: 334–342.

Page 12: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

12

Barriere, A., S.-P. Yang, E. Pekarek, C. G. Thomas,E. S. Haag, et al., 2009 Detecting heterozygosity in shot-gun genome assemblies: Lessons from obligately outcross-ing nematodes. Genome Res 19: 470–480.

Barton, N. H., and B. Charlesworth, 1998 Why sex andrecombination? Science 281: 1986–90.

Begun, D. J., and C. F. Aquadro, 1992 Levels of naturallyoccurring dna polymorphism correlate with recombinationrates in D. melanogaster. Nature 356: 519–20.

Begun, D. J., A. K. Holloway, K. Stevens, L. W.Hillier, Y.-P. Poh, et al., 2007 Population genomics:whole-genome analysis of polymorphism and divergence inDrosophila simulans. PLoS Biol 5: 2534–2559.

Boetzer, M., C. V. Henkel, H. J. Jansen, D. Butler,and W. Pirovano, 2011 Scaffolding pre-assembled contigsusing sspace. Bioinformatics 27: 578–579.

Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H.Langley, and W. Stephan, 1995 The hitchhiking effecton the site frequency spectrum of DNA polymorphisms.Genetics 140: 783–96.

Bumbarger, D. J., M. Riebesell, C. Rodelsperger, andR. J. Sommer, 2013 System-wide rewiring underlies be-havioral differences in predatory and bacterial-feeding ne-matodes. Cell 152: 109–119.

Cao, J., K. Schneeberger, S. Ossowski, T. Gunther,S. Bender, et al., 2011 Whole-genome sequencing of mul-tiple Arabidopsis thaliana populations. Nature genetics .

Charlesworth, B., M. T. Morgan, andD. Charlesworth, 1993 The effect of deleteriousmutations on neutral molecular variation. Genetics 134:1289–303.

Charlesworth, J., and A. Eyre-Walker, 2008 The Mc-DonaldKreitman test and slightly deleterious mutations.Mol Biol Evol 25: 1007–1015. PMID: 18195052.

Cutter, A. D., and J. Y. Choi, 2010 Natural selectionshapes nucleotide polymorphism across the genome of thenematode Caenorhabditis briggsae. Genome Research 20:1103–1111. PMID: 20508143.

Cutter, A. D., M.-A. Felix, A. Barriere, andD. Charlesworth, 2006 Patterns of nucleotide polymor-phism distinguish temperate and tropical wild isolates ofCaenorhabditis briggsae. Genetics 173: 2021–2031. PMID:16783011.

Cutter, A. D., and B. A. Payseur, 2013 Genomic sig-natures of selection at linked sites: unifying the disparityamong species. Nature Reviews Genetics 14: 262–274.

Denver, D. R., K. A. Clark, and M. J. Raboin, 2011Reproductive mode evolution in nematodes: insights frommolecular phylogenies and recently discovered species. MolPhylogenet Evol 61: 584–592.

Dieterich, C., S. W. Clifton, L. N. Schuster, A. Chin-walla, K. Delehaunty, et al., 2008 The Pristionchuspacificus genome provides a unique perspective on nema-tode lifestyle and parasitism. Nat Genet 40: 1193–1198.

Foxe, J. P., T. Slotte, E. A. Stahl, B. Neuffer,H. Hurka, et al., 2009 Recent speciation associated withthe evolution of selfing in Capsella. Proc Natl Acad Sci US A 106: 5241–5245.

Gillespie, J. H., 2000 Genetic drift in an infinite population.The pseudohitchhiking model. Genetics 155: 909–19.

Gnerre, S., I. Maccallum, D. Przybylski, F. J. Ribeiro,J. N. Burton, et al., 2011 High-quality draft assembliesof mammalian genomes from massively parallel sequencedata. Proc Natl Acad Sci U S A 108: 1513–1518.

Gouzy, J., S. Carrere, and T. Schiex, 2009 FrameDP: sen-sitive peptide detection on noisy matured sequences. Bioin-formatics 25: 670–671.

Gusev, A., J. K. Lowe, M. Stoffel, M. J. Daly, D. Alt-shuler, et al., 2009 Whole population, genome-wide map-ping of hidden relatedness. Genome Res 19: 318–326.

Hill, W. G., and A. Robertson, 1966 The effect of linkageon limits to artificial selection. Genet Res 8: 269–94.

Hudson, R. R., and N. L. Kaplan, 1995 Deleterious back-ground selection with recombination. Genetics 141: 1605–17.

Huson, D. H., and D. Bryant, 2006 Application of phyloge-netic networks in evolutionary studies. Mol Biol Evol 23:254–267.

Kanzaki, N., E. J. Ragsdale, M. Herrmann, W. E.Mayer, and R. J. Sommer, 2012 Description of three Pris-tionchus species (nematoda: Diplogastridae) from Japanthat form a cryptic species complex with the model organ-ism P. pacificus. Zoolog Sci 29: 403–417.

Kawecki, T. J., and D. Ebert, 2004 Conceptual issues inlocal adaptation. Ecology Letters 7: 12251241.

Keightley, P. D., and A. Eyre-Walker, 2007 Joint infer-ence of the distribution of fitness effects of deleterious muta-tions and population demography based on nucleotide poly-morphism frequencies. Genetics 177: 2251–2261. PMID:18073430.

Kelley, D. R., M. C. Schatz, and S. L. Salzberg, 2010Quake: quality-aware detection and correction of sequenc-ing errors. Genome Biol 11: R116.

Kent, W. J., 2002 Blat–the blast-like alignment tool.Genome Res 12: 656–664.

Kienle, S., and R. J. Sommer, 2013 Cryptic variation invulva development by cis-regulatory evolution of a hairy-binding site. Nat Commun 4: 1714.

Kim, S., V. Plagnol, T. T. Hu, C. Toomajian, R. M.Clark, et al., 2007 Recombination and linkage disequilib-rium in Arabidopsis thaliana. Nat Genet 39: 1151–1155.

Kimura, M., and T. Ohta, 1973 The age of a neutral mutantpersisting in a finite population. Genetics 75: 199–212.PMID: 4762875.

Korf, I., 2004 Gene finding in novel genomes. BMC Bioin-formatics 5: 59.

Krawitz, P., C. Rodelsperger, M. Jger, L. Jostins,S. Bauer, et al., 2010 Microindel detection in short-readsequence data. Bioinformatics 26: 722–729.

Lee, K.-Z., A. Eizinger, R. Nandakumar, S. C. Schus-ter, and R. J. Sommer, 2003 Limited microsyn-teny between the genomes of Pristionchus pacificus andCaenorhabditis elegans. Nucleic Acids Res 31: 2553–2560.

Li, H., and R. Durbin, 2009 Fast and accurate short readalignment with burrows-wheeler transform. Bioinformatics25: 1754–1760.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan,et al., 2009 The sequence alignment/map format and sam-tools. Bioinformatics 25: 2078–2079.

Li, R., W. Fan, G. Tian, H. Zhu, L. He, et al., 2010a Thesequence and de novo assembly of the giant panda genome.Nature 463: 311–317.

Li, R., H. Zhu, J. Ruan, W. Qian, X. Fang, et al., 2010b Denovo assembly of human genomes with massively parallelshort read sequencing. Genome Res 20: 265–272.

Lunter, G., and M. Goodson, 2011 Stampy: a statistical al-gorithm for sensitive and fast mapping of illumina sequencereads. Genome Res 21: 936–939.

Page 13: Characterization of genetic diversity in the nematode ... · 5.2 million substitutions and 1.2 million indels indicating a sequence divergence of approximately 10% that is dis-tributed

13

Macpherson, J. M., G. Sella, J. C. Davis, and D. A.Petrov, 2007 Genomewide spatial correspondence be-tween nonsynonymous divergence and neutral polymor-phism reveals extensive adaptation in Drosophila. Genetics177: 2083–99.

Mayer, W. E., L. N. Schuster, G. Bartelmes, C. Di-eterich, and R. J. Sommer, 2011 Horizontal gene transferof microbial cellulases into nematode genomes is associatedwith functional assimilation and gene turnover. BMC EvolBiol 11: 13.

Maynard Smith, J., and J. Haigh, 1974 The hitch-hikingeffect of a favourable gene. Genet Res 23: 23–35.

McDonald, J. H., and M. Kreitman, 1991 Adaptive pro-tein evolution at the Adh locus in Drosophila. Nature 351:652–4.

Messer, P. W., and D. A. Petrov, 2012 The McDonald-Kreitman test and its extensions under frequent adapta-tion: Problems and solutions. arXiv:1211.0060 .

Molnar, R. I., H. Witte, I. Dinkelacker, L. Villate,and R. J. Sommer, 2012 Tandem-repeat patterns and mu-tation rates in microsatellites of the nematode model organ-ism Pristionchus pacificus. G3 (Bethesda) 2: 1027–1034.

Morgan, K., A. McGaughran, L. Villate, M. Her-rmann, H. Witte, et al., 2012 Multi locus analysis ofPristionchus pacificus on La Reunion island reveals an evo-lutionary history shaped by multiple introductions, con-strained dispersal events and rare out-crossing. Mol Ecol21: 250–266.

Mugal, C. F., J. B. W. Wolf, and I. Kaj, 2014 Why timematters: Codon evolution and the temporal dynamics ofdN/dS. Mol Biol Evol 31: 212–231. PMID: 24129904.

Naglyaki, T., 1992 Introduction to Theoretical PopulationGenetics. Springer.

Neher, R. A., 2013 Genetic draft, selective interference, andpopulation genetics of rapid adaptation. Annual Review ofEcology, Evolution, and Systematics 44: 195–215.

Neher, R. A., and O. Hallatschek, 2013 Genealogies ofrapidly adapting populations. PNAS 110: 437–442.

Neher, R. A., T. A. Kessinger, and B. I. Shraiman, 2013Coalescence and genetic diversity in sexual populations un-der selection. PNAS 110: 15836–15841. PMID: 24019480.

Neher, R. A., and B. I. Shraiman, 2011 Genetic draft andquasi-neutrality in large facultatively sexual populations.Genetics 188: 975–996.

Nordborg, M., B. Charlesworth, andD. Charlesworth, 1996 The effect of recombination onbackground selection. Genet Res 67: 159–74.

Nordborg, M., T. T. Hu, Y. Ishino, J. Jhaveri,C. Toomajian, et al., 2005 The pattern of polymorphismin Arabidopsis thaliana. PLoS Biol 3: e196.

Patterson, N., A. L. Price, and D. Reich, 2006 Popula-tion structure and eigenanalysis. PLoS Genet 2: e190.

Phillips, P. C., 2006 One perfect worm. Trends in Genetics22: 405–407.

Price, M. N., P. S. Dehal, and A. P. Arkin, 2009 Fasttree:computing large minimum evolution trees with profiles in-stead of a distance matrix. Mol Biol Evol 26: 1641–50.

Pupko, T., I. Pe, R. Shamir, and D. Graur, 2000 A fast

algorithm for joint reconstruction of ancestral amino acidsequences. Mol Biol Evol 17: 890–896. PMID: 10833195.

Rae, R., H. Witte, C. Rodelsperger, and R. J. Sommer,2012 The importance of being regular: Caenorhabditis ele-gans and Pristionchus pacificus defecation mutants are hy-persusceptible to bacterial pathogens. Int J Parasitol 42:

747–753.Rocha, E. P., J. M. Smith, L. D. Hurst, M. T. Holden,

J. E. Cooper, et al., 2006 Comparisons of dN/dS are timedependent for closely related bacterial genomes. Journal ofTheoretical Biology 239: 226–235.

Rockman, M. V., and L. Kruglyak, 2009 Recombinationallandscape and population genomics of Caenorhabditis ele-gans. PLoS Genet 5: e1000419.

Rodelsperger, C., and C. Dieterich, 2010 Cyntenator:progressive gene order alignment of 17 vertebrate genomes.PLoS One 5: e8861.

Sargsyan, O., and J. Wakeley, 2008 A coalescent processwith simultaneous multiple mergers for approximating thegene genealogies of many marine organisms. TheoreticalPopulation Biology 74: 104–114.

Seger, J., W. Smith, J. Perry, J. Hunn, Z. Kaliszewska,et al., 2010 Gene genealogies strongly distorted by weaklyinterfering mutations in constant environments. Genetics184: 529.

Sella, G., D. A. Petrov, M. Przeworski, and P. Andol-fatto, 2009 Pervasive natural selection in the Drosophilagenome? PLoS Genet 5: e1000495.

Sinha, A., R. J. Sommer, and C. Dieterich, 2012 Diver-gent gene expression in the conserved dauer stage of thenematodes Pristionchus pacificus and Caenorhabditis ele-gans. BMC Genomics 13: 254.

Srinivasan, J., W. Sinz, T. Jesse, L. Wiggers-Perebolte, K. Jansen, et al., 2003 An integrated physicaland genetic map of the nematode Pristionchus pacificus.Mol Genet Genomics 269: 715–722.

Srinivasan, J., W. Sinz, C. Lanz, A. Brand, R. Nandaku-mar, et al., 2002 A bacterial artificial chromosome-basedgenetic linkage map of the nematode Pristionchus pacificus.Genetics 162: 129–134.

Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi,G. Kwan, et al., 2010 Transcript assembly and quantifica-tion by rna-seq reveals unannotated transcripts and iso-form switching during cell differentiation. Nat Biotechnol28: 511–515.

Wakeley, J., 2008 Coalescent theory. Roberts & Company.Walczak, A. M., L. E. Nicolaisen, J. B. Plotkin, and

M. M. Desai, 2012 The structure of genealogies in thepresence of purifying selection: A ”fitness-class coalescent”.Genetics 190: 753–779.

Wright, S. I., B. Lauga, and D. Charlesworth, 2002Rates and patterns of molecular evolution in inbred andoutbred Arabidopsis. Mol Biol Evol 19: 1407–1420.

Xie, C., and M. T. Tammi, 2009 Cnv-seq, a new methodto detect copy number variation using high-throughput se-quencing. BMC Bioinformatics 10: 80.