Genomic diversity and population structure in switchgrass, Panicum virgatum:...

Preview:

Citation preview

Genomic diversity and population structure in switchgrass, Panicum virgatum:

Genotyping-by-sequencing and population genomics

Geoff Morris*, Paul Grabowski, Justin BorevitzDept. of Ecology and Evolution

University of Chicago

Genomic diversity and population structure

• Geographic patterns of genomic diversity reflect: drift, migration, and adaptation

• Genomic diversity: nucleotide variation and insertions/deletions across many loci in the nuclear and organellar genomes.

• Leads to design of mapping populations for quantitative genetics and molecular breeding

Genomic diversity and natural history

Emerson et al. PNAS 2010

Example: Pitcher plant mosquito (Wyeomyia smithii)

Ecotypic diversity in switchgrass

• Switchgrass and other wide-ranging grassland species have many ecotypes

• Great variability in size, shape, color, and habitat preference• Example: Upland/lowland divergence

Upland (Michigan) Lowland (Oklahoma)

Adapted to: Shorter growing season,Drier climates

Adapted to: Long growing season,Wet climates

Effects of ecotype diversity of productivity

• Three year plot (6m2) experiment at Fermilab• ~20% overyield in switchgrass mixtures compared to

monocultures

“Genomic diversity and population structure in switchgrass, Panicum virgatum: from the continental scale to a dune landscape”

Morris, Grabowski, and BorevitzAccepted, Molecular Ecology

Biogeography of Indiana Dunes flora

Coastal Plain flora: e.g. Seaside spurge, Marramgrass

Boreal flora: e.g. Jack Pine, Bearberry

Great Plains flora: e.g. Sandreed, Little Bluestem

Eastern deciduous flora: e.g. Tulip tree

Recolonized post-glaciacation: ~10,000 years ago

Switchgrass gene pools

Zhang et al. 2011

?

Landscapes in Indiana Dunes

Landscape features are dynamic and can be dated:•100s – 1000s of years for dunes•10s – 100s of years for blowouts

Big blowout ~ 150 years old

Study questions

• Can switchgrass population structure be confirmed with a genome-wide sample of non-ascertained markers?

• In a hierarchical sample of switchgrass, how much diversity is there on a landscape, regional, and continental scale?

• Did multiple switchgrass gene pools contribute to the Indiana Dunes populations?

• Is there genomic diversity in a single landscape feature (blowout)?

• Is there local (private) genetic diversity in the Indiana Dunes?

Switchgrass plant samples

• Switchgrass cultivated varieties (cultivars)– Kanlow (Oklahoma - lowland)– Blackwell (Oklahoma - upland)– High Tide (Maryland - Coastal)– Forestburg and Sunburst (South Dakota)– Dacotah (North Dakota)– Cave-in-Rock (Illinois)– Southlow (Southern Michigan “ecopool”)

• Indiana Dunes switchgrass– Big Blowout– Jack pine savanna– Interdune

Problems with traditional markers systems

• Locus sampling:– Typically only a few kb are sequenced in a few loci (rDNA, cp introns)

– Large stochastic error and loci-specific bias

– e.g. Plant chloroplast has 100X lower rate of evolution than animal mitochondria

• Ascertainment bias:– Occurs whenever markers are discovered and typed separately

– Worst when ascertainment panel is geographically restricted subpopulation

– e.g. Inferred genetic diversity in Africans is spuriously low when when European markers are used

= restriction site1) PstI digest of genomic DNA

2) End-polish, blunt-end ligation; Illumina barcodes

3) PCR amplify and pool fragments from multiple samples

4) Assemble and map reads to “stacks” and call SNPs

Genomic diversity from de novo sequencing

• Reduced representation + multiplexing = more samples• 10,000+ candidate SNPs• No reference genome needed• Data here from 76 or 100 bp paired end reads• 40 billion base pair data set

Plastome sequence in RRLs

• Nuclear whole genome shotgun sequence is too light (<<1X) for assembly

• Plastome WGS is very high (>>1X)

1) PstI digest of genomic DNA, with star activity and random shearing

2) End-polish, blunt-end ligation

Analysis of chloroplast data

• Chloroplast genome sequence (plastome) included in data• Random (shotgun) sequence + 20 PstI sites• Switchgrass chloroplast reference available (Upland and

Lowland)• Mapped reads to both ~140,000 base pair chloroplast

genomes• Coverage (# of times each position is read): 1X – 786X

Chloroplast coverage and polymorphisms

Position (kb)

ChloroplastGenomeCoverage

Chloroplast phylogeny

• Neighbor joining tree based on 140kb

• Named haplogroups have >50% bootstrap

• Unfilled lines indicate low-coverage sample

Chloroplast phylogeny

Chloroplast phylogeny

Population analysis of nuclear loci

• Create “pseudoreference” of RRL loci with de novo assembly

• Map reads to pseudoreference to create stacks (150-1500 reads)

• Map reads to switchgrass chloroplast and sorghum mitochondria, and drop stacks that match organelles

• Select single-nucleotide variants that:

– Have high sequence quality (PHRED score < 0.001 for both alleles)

– Vary in frequency across samples (chi-square < 0.01)

– Are nearest to restriction site, closest to beginning of read

• Randomly select one allele per sample (weighted by observed frequency)

Coding sequence variation in the chloroplast

• 77 coding genes in chloroplast (including Rubisco, ribosome, etc)

– 60kb of coding sequence

• Constraints in non-synonymous (NS) vs. synonymous (S) variation provides biological validation for SNPs

• Upland vs. Lowland (~1 million years):

– 23 NS : 16 S (ratio = 1.4)

• Within upland ( < 0.5 millions years)

– 16 NS : 3 S (ratio = 5.3)

Nuclear genome: Multidimensional scaling

~11000 nuclear loci, mean of 100 random allele samples

Nuclear loci: Structure analysis

Bayesian clustering algorithm ~11000 nuclear loci, random allele sample, Burn-in 10K, Run 10K

Conclusions

• Confirmed upland vs. lowland differentiation and differentiated a local population using non-ascertained markers

• Lake Michigan switchgrass is distinct from broader upland population in midwest and Great Plains.

• Post-glacial gene flow into the Indiana Dunes included genotypes from across the Great Plains and Midwest

• The chloroplast diversity in the Indiana Dunes did not evolve in the current midwestern population, but originated one or more glacial cycles ago

• A single blowout in the dunes can have as much chloroplast diversity as the Midwest

New GBS methods for population genomics

• For true population analysis we need 10+ individuals in multiple populations

• Illumina multiplexing is too expensive – separate prep cost for each library adds $100s/sample

• Read count overdispersion (up to ~200X more Poisson) requires technical replicates to even counts

• Sticky-end ligation increases specificity and removes random sequence (including plastome)

Genotype-By-Sequencing (GBS)Based on Elshire et al. 2011, PlosONE

GBS on continental + dunes switchgrass

New population genomic studies with GBS

1. Continental population structure (126 individuals)– 50/50 deep diversity and shallow diversity based on chloroplast

markers and SSRs

2. Tetraploid cultivars (24 each for TX, OK, NE, ND cultivars)– Ploidy differences may be confounded with genetic diversity– High sample size should allow traditional pop gen analyses (Fst etc...)

3. Dune half-sibs (4 mothers and 10 offspring each)– True SNPs will segregate in the offspring while homeologous

substitutions will not

Bioinformatics overview

• No software package for population genomic analysis on GBS• Stacks (U. Oregon) comes closest but multinomial sampling

model expects high frequency SNPs (e.g. mapping population)• Buckler lab TASSEL package (Java) may be appropriate • We’ve been using custom pipeline (CLC, MySQL, R) for

analysis– http://create.ly/gefxsub43

Recommended