17
Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient inference. Unphased SNP data: with short reads, it may not be possible to accurately infer haplotypes. Mixtures: with short reads, it may not be possible to assemble sequences corresponding to different individuals. Selection: although the coalescent can be extended to include the effects of selection, in general the resulting processes are difficult to simulate. Complex models: likelihood surfaces may be difficult to reconstruct in models with many parameters using inherently noisy stochastic simulations of the coalescent. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 1 / 17

Frequency Spectra and Inference in Population Genetics

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Frequency Spectra and Inference in Population Genetics

Frequency Spectra and Inference in Population Genetics

Although coalescent models have come to play a central role in population genetics,there are some situations where genealogies may not lead to efficient inference.

Unphased SNP data: with short reads, it may not be possible to accurately inferhaplotypes.

Mixtures: with short reads, it may not be possible to assemble sequencescorresponding to different individuals.

Selection: although the coalescent can be extended to include the effects ofselection, in general the resulting processes are difficult to simulate.

Complex models: likelihood surfaces may be difficult to reconstruct in models withmany parameters using inherently noisy stochastic simulations of the coalescent.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 1 / 17

Page 2: Frequency Spectra and Inference in Population Genetics

An alternative approach is to use the distribution of allele frequencies in sampledpopulations to make inferences about the evolutionary forces operating in thosepopulations.

The allele frequency spectrum (AFS) is the distribution of the frequencies orcounts of derived alleles in a sample of individuals calculated over all segregatingsites.

The folded allele frequency spectrum (folded AFS) is the distribution of thefrequencies or counts of minor alleles in a sample calculated over all segregatingsites.

The AFS and folded AFS will be similar when the derived alleles are also the minoralleles, but this will not always be the case.

However, the folded AFS can always be calculated, whereas the AFS can only becalculated if we can reliably distinguish derived from ancestral alleles.

When it can be calculated, the AFS will be more informative about the populationhistory than the folded AFS.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 2 / 17

Page 3: Frequency Spectra and Inference in Population Genetics

Allele frequency spectra of autosomal and X-linked genes in M. musculus

The allele frequency spectrum for a sampleof n chromosomes can be represented by avector

a = (a0, · · · , an)

where ai denotes the number of segregatingsites at which the derived allele is carried byexactly i chromosomes.

For example, a1 is the number of singletonsin the sample.

Athanosios et al. (2014)

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 3 / 17

Page 4: Frequency Spectra and Inference in Population Genetics

Folded and Unfolded AFS in D. melanogaster

Cooper et al. (2015)

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 4 / 17

Page 5: Frequency Spectra and Inference in Population Genetics

To use the allele frequency spectrum for inference, we need to be able to calculate itslikelihood under models that account for the evolutionary processes thought to beoperating on the population.

The probability density of the frequency x of the derived allele will depend on thechoice of the model and its parameter through some function f (x |Θ).

If n chromosomes are sampled at random from a population in which thefrequency of the derived allele is x , the probability that it will be sampled k timesis given by the binomial distribution:

P(k|x) =

(n

k

)xk(1− x)n−k .

The unconditional probability of sampling the derived allele k times can becalculated by integrating the preceding conditional probability with respect to thedensity of x :

P(k|Θ) =

∫ 1

0

f (x |Θ) ·

(n

k

)xk(1− x)n−kdx

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 5 / 17

Page 6: Frequency Spectra and Inference in Population Genetics

If we assume that the segregating sites are independent of one another, then the totalprobability of the allele frequency spectrum A = (a1, a2, · · · , an−1) can be calculated asfollows:

P(A|Θ) =n−1∏k=1

e−P(k|Θ) P(k|Θ)ak

ak !.

This assumes an infinite sites model, with new mutations produced by a Poissonprocess.

It also assumes that the segregating sites in the sample are unlinked (and hencestatistically uncorrelated conditional on Θ).

With linked sites, we can treat this as a composite likelihood, which is known togive consistent estimates of parameter values under many neutral models, eventhough it underestimates the variance in these estimates.

Bootstraps (conventional and parametric) can be used to obtain confidenceintervals when working with composite likelihoods.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 6 / 17

Page 7: Frequency Spectra and Inference in Population Genetics

Joint Allele Frequency Spectrum

If individuals are sampled from different populations, the distribution of allelefrequencies cross populations can be characterized using the joint allele frequencyspectrum, which specifies the joint distribution of the counts of derived alleles indifferent populations.

For example, with two populations, we need to keep track of the number ofsegregating sites at which the derived allele was sampled i times in the firstpopulation and j times in the second population.

To evaluate the likelihood of the joint AFS, we then need a model that will allowus to calculate the joint probability distribution f (x1, x2) of the derived allelefrequencies in the two populations.

Dependence between the allele frequencies in the different populations will arisebecause of shared ancestry and migration.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 7 / 17

Page 8: Frequency Spectra and Inference in Population Genetics

Joint Allele Frequency Spectra of Two Populations (Simulated)

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 8 / 17

Page 9: Frequency Spectra and Inference in Population Genetics

In principal, the distribution of the derived allele frequencies in a population could becalculated under either the Wright-Fisher model or some other suitable Markov chain(e.g., the Moran model).

In practice, this is difficult because usually analytical expressions are not availablefor these models, which instead must be solved using calculations involving largematrices.

In addition, changing the effective population size changes the state space of theWright-Fisher model, which will increase computational costs.

Fortunately, these complications can be sidestepped to a degree by working with anapproximation to the Wright-Fisher model which is accurate when the effectivepopulation size is sufficiently large, say Ne > 100.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 9 / 17

Page 10: Frequency Spectra and Inference in Population Genetics

The following figure shows a series of simulations of the Wright-Fisher model for 100generations for N = 10 (blue), 100 (red), 1000 (orange), and 10, 000 (green).

Notice that both the size of the fluctuations and the total change in p over 100generations decrease as the population size is increased.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 10 / 17

Page 11: Frequency Spectra and Inference in Population Genetics

A different picture emerges if we plot each sample path against time measured in unitsof N generations. Here the total rescaled time is t = 1.

In this case, the jumps become smaller, but the total change in p over N generationsdoes not tend to 0.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 11 / 17

Page 12: Frequency Spectra and Inference in Population Genetics

Diffusion Processes in Population Genetics

In fact, it can be shown that as N →∞, the rescaled processes (p(N)(Nt) : t ≥ 0)converge to a limiting process known as a diffusion approximation.

Diffusion processes are Markov processes with continuous sample paths, i.e., thereare no jumps.

Diffusion approximations can be derived for the Wright-Fisher model withmutation and selection provided that these other processes are sufficiently weak.

The transition densities f (p; t) satisfy a partial differential equation known as theKolmogorov forward equation:

∂tφ(x ; t) =

∂2

∂x2

(x(1− x)

4Neφ(x ; t)

)− ∂

∂x

(sx(1− x)φ(x ; t)

)This equation can be solved numerically using sophisticiated techniques availablefor PDE’s that do not require stochastic simulations.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 12 / 17

Page 13: Frequency Spectra and Inference in Population Genetics

Diffusion approximations can also be derived for models with more than one population.In this case, the joint density of the allele frequencies in the different populations can befound by solving the following PDE:

∂tφ =

1

2

P∑i=1

∂2

∂x2i

{xi (1− xi )

4νiφ

}

−P∑i=1

∂xi

[{γixi (1− xi ) +

P∑j=1

Mi←j(xj − xi )

}

where

νi is the relative effective population size of population i ;

γi = 2Nsi is the scaled selective coefficient of the derived allele in population i ;

Mi←j = 2Nmi←j is the scaled migration rate from j to i .

This must be solved numerically, but again this can be done without stochasticsimulations.

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 13 / 17

Page 14: Frequency Spectra and Inference in Population Genetics

Example: Out of Africa analysis

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 14 / 17

Page 15: Frequency Spectra and Inference in Population Genetics

Example: Out of Africa analysis

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 15 / 17

Page 16: Frequency Spectra and Inference in Population Genetics

Example: Settlement of the New World analysis

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 16 / 17

Page 17: Frequency Spectra and Inference in Population Genetics

Example: Settlement of the New World analysis

Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 17 / 17