36
Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms Magnus Nordborg University of Southern California

Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Genealogical trees, coalescent theory,and the analysis of genetic

polymorphisms

Magnus Nordborg

University of Southern California

1

Page 2: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The importance of history

• Genetic polymorphism data represent the outcome ofa single, highly complex, non-repeatable evolutionaryhistory

• Traditional analysis methods cannot take this intoaccount

• The stochastic process known as “the coalescent”presents a coherent statistical framework foranalyzing genetic polymorphism data

2

Page 3: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The importance of history: mutations arerandom

G T

G T

T TG G GGGG

MRCA

3

Page 4: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The importance of history: trees are random

4

Page 5: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Modeling genetic polymorphism

At a minimum, models must include:

• coalescence (who begat whom, and when)

• mutation

• recombination

5

Page 6: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Recombinationmakes it possiblefor linked sites to

have differentgenealogies

induced trees

breakpoint

recombination

coalescence

6

Page 7: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

What is the coalescent?

• The coalescent is a stochastic process that iswell-suited for modeling polymorphism data

• It is a natural extension to classical populationgenetics models

7

Page 8: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Coalescence: picking parents

N = 10th

e pa

st

n = 3

T(3)

T(2)

8

Page 9: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The rate of coalescence

The rate at which lineages find each other depends on:

• The population size: the per-generation probability ofcoalescence is ∝ 1/N

• The number of lineages: the rate of coalescence whenthere are k lineages is

(k2

)• A number of other demographic factors, such as

inbreeding, age structure, and the variance inreproductive success

Because the per-generation probability of coalescence is onthe order of 1/N , we use a continuous-time approximationwhere time is measured in units of N generations

9

Page 10: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Mutation

• Selectively neutral mutations are added to thebranches of the tree afterwards according to a ratethat depends on the per-generation probability ofmutation

• The expected number of mutations on a branchdepends on its length — the expected number ofmutations on the tree depends on the total branchlength of the tree

• Any mutation model can be used

10

Page 11: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Recombination

• Recombination breaks up lineages according to a ratethat depends on the per-generation probability ofrecombination

• There will be more recombination in the genealogy ofa longer chromosomal segment

• Any recombination model can be used

• The coalescent with recombination generates arandom graph — or a forest of trees

11

Page 12: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

A graph or a forest. . .

induced trees

breakpoint

recombination

coalescence

12

Page 13: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

A walk through tree space

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

chromosomal position

time

to M

RCA

13

Page 14: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The trees are correlated

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

1 3 2 654 621543

14

Page 15: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The trees are correlated

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

13 2 654 621543 621543

15

Page 16: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Recombination is common

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

this may be 10 kb!

these arejunctions

these are mutations

16

Page 17: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Recombination is as common as mutation

• If 1 cM ∼ 1 Mb, then the probability ofrecombination per bp per generation is ∼ 10−8

• The probability of mutation per bp per generation isestimated to be at most 10−8

• It follows that a sample of sequences will contain asmany junctions as polymorphisms

17

Page 18: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Genealogical graphs can in general not bereconstructed

• Even with infinitely many polymorphisms, asubstantial fraction of all junctions would not bedetected

• In reality, there are clearly too few polymorphisms perjunction to estimate the graph

• Remember: a phylogenetic algorithm will alwaysreconstruct a tree, regardless of whether there existsa tree to be reconstructed. . .

18

Page 19: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

We do not in general wish to reconstructgenealogical graphs

• Population genetics is not phylogenetics!

• Gene genealogies are of no interest per se — they arerandom outcomes of an underlying evolutionaryprocess, and are of interest only insofar as theycontain information about this process

19

Page 20: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Gene trees and species trees

Phylogenetic methods estimatespecies trees by estimating genetrees; they are appropriate if andonly if the latter are stronglycorrelated with the former

20

Page 21: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Phylogenetic methods are not applicable towithin-species data

Africa Europe Asia

Africa

Africa Europe Asia

Africa

Africa Europe Asia

Africa

a) Out-of-AfricaModel

b) MultiregionalModel

c) CandelabraModel

millionyears ago

1

0

0.5

migration

Africans

Schematic version of the human mtDNA

tree

non-Africans

• We must consider the likelihood of the data underalternative models

21

Page 22: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

A likelihood framework

Phylogenetics:L = P(D|G, µ)

Population genetics:

L =∑G

P(D|G, µ)P(G, α)

Here D is the data, G the genealogy, µ the mutationmodel, and α the demographic model

Note that G is a nuisance parameter in population genetics

22

Page 23: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Uses of the coalescent

• A mathematical modeling tool

• A simulation tool for hypothesis testing andexploratory data analysis

• The basis for full likelihood inference

23

Page 24: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The simplicity and elegance of the coalescentprocess makes it a powerful modeling tool

At least for the standard coalescent, it is often possible toderive results analytically

• Estimators and test, e.g., Tajima’s D statistic

• Illuminating theoretical results, e.g., the probabilitythat a sample of size n contains the MRCA of theentire population is

n− 1n + 1

24

Page 25: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Almost any scenario can be simulated usingthe coalescent

• Coalescent simulations are enormously more efficientthan classical methods

• Simulated data can be compared with real data — orused to evaluate the feasibility of a study before it iscarried out

25

Page 26: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Example: ancient Neanderthal mtDNA

986 modern humans

Neanderthalts Te

Tr• Modern humans

monophyletic

• Tr > 4Te

Does this prove that Neanderthals and modern humans didnot interbreed?

26

Page 27: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Example: ancient Neanderthal mtDNA

986 modern humans

Neanderthalts Te

Tr

Med

iterra

nean

Assuming that they didinterbreed, what is theprobability of getting atree like the oneobserved just bychance?

Coalescent simulations showed that this probability is higheven for large amounts of interbreeding

27

Page 28: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Full likelihood analysis

• In principle possible

• In practice difficult

• Unless major breakthroughs are made, not likely to beapplicable to genomic polymorphism data

28

Page 29: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

What is the main insight from coalescenttheory?

That very large numbers of loci arerequired to answer most questions!

29

Page 30: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Population Genomics is upon us!

• Data sets containing 100’s and 1000’s of loci alreadyexist

• Within 10 years, it seems likely that whole-genomecomparisons between species will be common, andthat we will have whole genome sequences from1000’s of humans

30

Page 31: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Less assumptions — more data

We will be able to use empirically estimated distributions oftest statistics rather than theoretically predicted ones

200 300 400 500 600position in kb

-2

-1

0

1

2

D

31

Page 32: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Selective sweeps

• Fixation of new alleles leaves a footprint in thepattern of genomic variation

• Can we find the genes that “make us human”

Selection

Advantageous variant

32

Page 33: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

How many genes?

33

Page 34: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

Teosinte to corn: < 10,000 years; five genes?

teosinte maize maize with tb1 mutation

34

Page 35: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

What’s the use polymorphism data?

• Whole-genome properties

– demographic (sensu lato) history

– molecular evolution

– genetic mechanisms

• The history of individual loci — selection

– divergence between human and other primates

– traces of selection within the last million years

35

Page 36: Genealogical trees, coalescent theory, and the analysis of ... › ~epxing › CBML › coalescent1 › gpl2002.pdf · Genealogical trees, coalescent theory, and the analysis of genetic

The history and future of multi-locusmethods

36