94
Population Genetics Models Coalescent Theory for Phylogenetic Inference Coalescent Theory: An Introduction for Phylogenetics Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University [email protected] May 11, 2010 Stat 882: Statistical Phylogenetics - Lecture 6

Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Coalescent Theory: An Introduction forPhylogenetics

Laura Salter KubatkoDepartments of Statistics and

Evolution, Ecology, and Organismal BiologyThe Ohio State University

[email protected]

May 11, 2010

Stat 882: Statistical Phylogenetics - Lecture 6

Page 2: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Relationship between population genetics and phylogenetics

I Population genetics: Study of genetic variation within apopulation

I Phylogenetics: Use genetic variation between taxa (species,populations) to infer evolutionary relationships

I So far, we’ve assumed:I Each taxon is represented by a single sequence – this is often

called “exemplar sampling”I We have data for a single gene and wish to estimate the

evolutionary history for that gene (the gene tree or genephylogeny)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 3: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Relationship between population genetics and phylogenetics

I Population genetics: Study of genetic variation within apopulation

I Phylogenetics: Use genetic variation between taxa (species,populations) to infer evolutionary relationships

I So far, we’ve assumed:I Each taxon is represented by a single sequence – this is often

called “exemplar sampling”I We have data for a single gene and wish to estimate the

evolutionary history for that gene (the gene tree or genephylogeny)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 4: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Relationship between population genetics and phylogenetics

I Population genetics: Study of genetic variation within apopulation

I Phylogenetics: Use genetic variation between taxa (species,populations) to infer evolutionary relationships

I So far, we’ve assumed:I Each taxon is represented by a single sequence – this is often

called “exemplar sampling”I We have data for a single gene and wish to estimate the

evolutionary history for that gene (the gene tree or genephylogeny)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 5: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Relationship between population genetics and phylogenetics

I Given current technology, we could do much more:I Sample many individuals within each taxon (species,

population, etc.)I Sequence many genes for all individuals

I Need models at two levels:I Model what happens within each population (standard

population genetics)I Apply within-population models to each population

represented on a phylogeny (more recent work)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 6: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Relationship between population genetics and phylogenetics

I Given current technology, we could do much more:I Sample many individuals within each taxon (species,

population, etc.)I Sequence many genes for all individuals

I Need models at two levels:I Model what happens within each population (standard

population genetics)I Apply within-population models to each population

represented on a phylogeny (more recent work)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 7: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

I Assumptions:

I Population of 2N gene copies

I Discrete, non-overlapping generations of equal size

I Parents of next generation of 2N genes are picked randomlywith replacement from preceding generation (geneticdifferences have no fitness consequences)

I Probability of a specific parent for a gene in the nextgeneration is 1

2N

Stat 882: Statistical Phylogenetics - Lecture 6

Page 8: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 9: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 10: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 11: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 12: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 13: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 14: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 15: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 16: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 17: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Wright-Fisher Model

Stat 882: Statistical Phylogenetics - Lecture 6

Page 18: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

The Coalescent Model

I Discrete Time Coalescent

I P(two genes have same parent in the previous generation) is1

2N

I Number of generations since two genes first shared a commonancestor ∼ Geometric( 1

2N )

I Number of generations since at least two genes in a sample of

k shared a common ancestor ∼ Geometric( k(k−1)4N )

Stat 882: Statistical Phylogenetics - Lecture 6

Page 19: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Define Gk,k to be the probability that k genes have k distinctancestors in the previous generation. Then

Gk,k =

(2N − 1

2N

)(2N − 2

2N

)· · ·

(2N − (k − 1)

2N

)=

(1− 1

2N

)(1− 2

2N

)· · ·

(1− k − 1

2N

)= 1−

(1 + 2 + 3 + · · ·+ (k − 1)

2N

)+O

(1

N2

)= 1− k(k − 1)

4N+O

(1

N2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 20: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Define Gk,k to be the probability that k genes have k distinctancestors in the previous generation. Then

Gk,k =

(2N − 1

2N

)(2N − 2

2N

)· · ·

(2N − (k − 1)

2N

)

=

(1− 1

2N

)(1− 2

2N

)· · ·

(1− k − 1

2N

)= 1−

(1 + 2 + 3 + · · ·+ (k − 1)

2N

)+O

(1

N2

)= 1− k(k − 1)

4N+O

(1

N2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 21: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Define Gk,k to be the probability that k genes have k distinctancestors in the previous generation. Then

Gk,k =

(2N − 1

2N

)(2N − 2

2N

)· · ·

(2N − (k − 1)

2N

)=

(1− 1

2N

)(1− 2

2N

)· · ·

(1− k − 1

2N

)

= 1−(

1 + 2 + 3 + · · ·+ (k − 1)

2N

)+O

(1

N2

)= 1− k(k − 1)

4N+O

(1

N2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 22: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Define Gk,k to be the probability that k genes have k distinctancestors in the previous generation. Then

Gk,k =

(2N − 1

2N

)(2N − 2

2N

)· · ·

(2N − (k − 1)

2N

)=

(1− 1

2N

)(1− 2

2N

)· · ·

(1− k − 1

2N

)= 1−

(1 + 2 + 3 + · · ·+ (k − 1)

2N

)+O

(1

N2

)

= 1− k(k − 1)

4N+O

(1

N2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 23: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Define Gk,k to be the probability that k genes have k distinctancestors in the previous generation. Then

Gk,k =

(2N − 1

2N

)(2N − 2

2N

)· · ·

(2N − (k − 1)

2N

)=

(1− 1

2N

)(1− 2

2N

)· · ·

(1− k − 1

2N

)= 1−

(1 + 2 + 3 + · · ·+ (k − 1)

2N

)+O

(1

N2

)= 1− k(k − 1)

4N+O

(1

N2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 24: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Therefore, the probability that at least two genes share acommon ancestor in the previous generation is

1− Gk,k =k(k − 1)

4N+O

(1

N2

)

I Since this is the same in each generation, we have that thenumber of generations until at least two genes in a sample ofk shared a common ancestor ∼ Geometric( k(k−1)

4N )

Stat 882: Statistical Phylogenetics - Lecture 6

Page 25: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Number of generations since at least two genes in a sample of k shared a common

ancestor ∼ Geometric( k(k−1)4N )

I Therefore, the probability that at least two genes share acommon ancestor in the previous generation is

1− Gk,k =k(k − 1)

4N+O

(1

N2

)

I Since this is the same in each generation, we have that thenumber of generations until at least two genes in a sample ofk shared a common ancestor ∼ Geometric( k(k−1)

4N )

Stat 882: Statistical Phylogenetics - Lecture 6

Page 26: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I Kingman (1982a, b, c) considered the case where N (populationsize) is very large relative to k (sample size).

I Then, we can ignore the terms that are O(1/N2) – this amounts toassuming that three or more genes coalescing in the samegeneration happens relatively rarely in comparison to two genescoalescing in one generation.

I We have

I Time since two gene copies had a common ancestor ∼exponential ( µ = 2N )

I Time to coalescence of k gene copies into k − 1 ∼exponential( µ = 4N/(k(k − 1)) )

where time, T , is measured in number of generations.

I This is generally a very good approximation, provided N is largeenough.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 27: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I Kingman (1982a, b, c) considered the case where N (populationsize) is very large relative to k (sample size).

I Then, we can ignore the terms that are O(1/N2) – this amounts toassuming that three or more genes coalescing in the samegeneration happens relatively rarely in comparison to two genescoalescing in one generation.

I We have

I Time since two gene copies had a common ancestor ∼exponential ( µ = 2N )

I Time to coalescence of k gene copies into k − 1 ∼exponential( µ = 4N/(k(k − 1)) )

where time, T , is measured in number of generations.

I This is generally a very good approximation, provided N is largeenough.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 28: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I Kingman (1982a, b, c) considered the case where N (populationsize) is very large relative to k (sample size).

I Then, we can ignore the terms that are O(1/N2) – this amounts toassuming that three or more genes coalescing in the samegeneration happens relatively rarely in comparison to two genescoalescing in one generation.

I We have

I Time since two gene copies had a common ancestor ∼exponential ( µ = 2N )

I Time to coalescence of k gene copies into k − 1 ∼exponential( µ = 4N/(k(k − 1)) )

where time, T , is measured in number of generations.

I This is generally a very good approximation, provided N is largeenough.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 29: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I To generate a genealogy of k genes under Kingman’scoalescent:

I Draw an observation from an exponential distribution withmean µ = 4N/(k(k − 1)). This will be the time of the firstcoalescent event (looking from the present backwards in time).

I Pick two lineages at random to coalescence.

I Decrease k by 1.

I If k = 1, stop. Otherwise, repeat these steps.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 30: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I To generate a genealogy of k genes under Kingman’scoalescent:

I Draw an observation from an exponential distribution withmean µ = 4N/(k(k − 1)). This will be the time of the firstcoalescent event (looking from the present backwards in time).

I Pick two lineages at random to coalescence.

I Decrease k by 1.

I If k = 1, stop. Otherwise, repeat these steps.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 31: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I To generate a genealogy of k genes under Kingman’scoalescent:

I Draw an observation from an exponential distribution withmean µ = 4N/(k(k − 1)). This will be the time of the firstcoalescent event (looking from the present backwards in time).

I Pick two lineages at random to coalescence.

I Decrease k by 1.

I If k = 1, stop. Otherwise, repeat these steps.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 32: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Continuous-time Coalescent – Kingman’s Approximation

I To generate a genealogy of k genes under Kingman’scoalescent:

I Draw an observation from an exponential distribution withmean µ = 4N/(k(k − 1)). This will be the time of the firstcoalescent event (looking from the present backwards in time).

I Pick two lineages at random to coalescence.

I Decrease k by 1.

I If k = 1, stop. Otherwise, repeat these steps.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 33: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Example Genealogies Under Kingman’s Coalescent

t15t10t4t16t12t11t20t5t1t2t19t18t3t14t8t7t17t13t6t9

t14t13t5t16t3t15t17t9t1t8t6t2t18t4t20t11t12t7t10t19

t11t14t17t8t15t3t13t1t19t18t5t6t2t7t4t12t9t16t20t10

t10t3t19t5t6t12t15t18t17t9t7t11t1t8t13t14t16t20t2t4

t18t9t19t5t15t13t7t20t4t14t11t12t2t6t8t17t10t3t1t16

t19t17t8t2t12t7t4t9t16t14t5t1t20t15t10t13t18t6t3t11

Stat 882: Statistical Phylogenetics - Lecture 6

Page 34: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies

I Two measures of the size of a genealogy are commonlydefined:

I TMRCA = the time of the most recent common ancestor of alllineages sampled

I Ttotal = the total time represented by the geneaology

I Of interest are the mean, variance, and probability distributionof these.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 35: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies

I Define Ti to be the time in the history of the sample duringwhich there were exactly i ancestral lineages.

I Note that TMRCA =∑k

i=1 Ti and Ttotal =∑k

i=1 iTi

T2

T3

T4

T5

Stat 882: Statistical Phylogenetics - Lecture 6

Page 36: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - TMRCA

I Note that TMRCA =∑k

i=1 Ti and Ti ∼ Exp(µ = 4Ni(i−1))

I Therefore, the mean is

E (TMRCA) =k∑

i=2

E (Ti ) =k∑

i=2

4N

i(i − 1)

= 4Nk∑

i=2

(1

i − 1− 1

i

)= 4N

(1− 1

k

)I If time is measured in units of 2N generations (coalescent

units), then the mean is 2(1− 1k )

Stat 882: Statistical Phylogenetics - Lecture 6

Page 37: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - TMRCA

I Mean time to coalescence of all lineages is

4N

(1− 1

k

)I Notes:

I When k is large, it takes ≈ 4N generations to reach theMRCA

I When k = 2, it takes ≈ 2N generations to reach the MRCA

I For a large sample, much of the total time represented in thegenealogy will be spent waiting for the last coalescence tooccur.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 38: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - TMRCA

I We can also show that Var(TMRCA) = (4N2)∑k

i=21

i2(i−1)2

I We can show that as the sample size k →∞, Var(TMRCA)converges to 4π2/3− 12 ≈ 1.16 (in coalescent units).

I Since TMRCA is the sum of k − 1 independent exponentialrandom variables Ti , we have the following distribution forTMRCA:

fTMRCA(t) =

k∑i=2

(i

2

)e−( i

2)tk∏

j=2,j 6=i

(j2

)(j2

)−

( i2

)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 39: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - Ttotal

I Note that Ttotal =∑k

i=1 iTi and Ti ∼ Exp(µ = 4Ni(i−1))

I Therefore, the mean is

E (Ttotal) =k∑

i=2

iE (Ti ) =k∑

i=2

i4N

i(i − 1)

= 4Nk−1∑i=1

1

i

I If time is measured in units of 2N generations (coalescentunits), then the mean is 2

∑k−1i=1

1i

Stat 882: Statistical Phylogenetics - Lecture 6

Page 40: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - Ttotal

I We can also show that Var(Ttotal) = (2N2)

[4

∑k−1i=1

1i2

]I Note that as the sample size k →∞, Var(Ttotal) converges to

2π2/3 ≈ 6.58 (in coalescent units).

I Since Ttotal is the sum of k − 1 independent exponentialrandom variables iTi , we have the following distribution forTtotal :

fTtotal(t) =

k∑i=2

i − 1

2e−

i−12

tk∏

j=2,j 6=i

j − 1

j − i

Stat 882: Statistical Phylogenetics - Lecture 6

Page 41: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies - TMRCA and Ttotal

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

t

fTMRCA

k=2k=5k=10k=20k=50k=100

k=2k=5k=10k=20k=50k=100

0 2 4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

t

fTtotal

k=2k=5k=10k=20k=50k=100

Stat 882: Statistical Phylogenetics - Lecture 6

Page 42: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies

I We need one more quantity to be able to link our populationgenetics model to our phylogenetic model – the probability that aspecified number of coalescent events have occurred in a fixedamount of time, t.

I The probability that u lineages coalesce into v lineages in time t isgiven by (Tavare, 1984; Watterson, 1984; Takahata and Nei, 1985;Rosenberg, 2002)

Puv (t) =u∑

j=v

e−j(j−1)t/2 (2j − 1)(−1)j−v

v !(j − v)!(v + j − 1)

j−1∏y=0

(v + y)(u − y)

u + y

Stat 882: Statistical Phylogenetics - Lecture 6

Page 43: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies

I We need one more quantity to be able to link our populationgenetics model to our phylogenetic model – the probability that aspecified number of coalescent events have occurred in a fixedamount of time, t.

I The probability that u lineages coalesce into v lineages in time t isgiven by (Tavare, 1984; Watterson, 1984; Takahata and Nei, 1985;Rosenberg, 2002)

Puv (t) =u∑

j=v

e−j(j−1)t/2 (2j − 1)(−1)j−v

v !(j − v)!(v + j − 1)

j−1∏y=0

(v + y)(u − y)

u + y

Stat 882: Statistical Phylogenetics - Lecture 6

Page 44: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Why study population genetics?Wright-Fisher ModelThe Coalescent

Properties of Genealogies

I When u and v are small, these are easy to compute. For example,

P21(t) = probability that 2 lineages coalescence to 1 lineage in time t

= probability of 1 coalescent event in time t when k=2

= P(T ≤ t),where T ∼ Exp(µ =4N

2(2− 1))

=

∫ t

0

1

2Ne−

x2N dx = 1− e−

t2N

I Similarly,

P22(t) = prob. of no coalescence in time t when k=2

= P(T > t)

=

∫ ∞

t

1

2Ne−

x2N dx = e−

t2N

Stat 882: Statistical Phylogenetics - Lecture 6

Page 45: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

The Coalescent Model Along a Species Tree

I So far, we’ve considered the coalescent process within a singlepopulation.

I A phylogenetic tree consists of many populations followedthroughout evolutionary time:

Stat 882: Statistical Phylogenetics - Lecture 6

Page 46: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

The Coalescent Model Along a Species Tree

I Goal is to apply coalescent model across the phylogeny. Thebasic assumption is that events that occur in one populationare independent of what happens in other populations withinthe phylogeny.

I More specifically, given the number of lineages entering andleaving a population, coalescent events within populations areindependent of one another.

I It is also important to recall an assumption we “inherit” fromour population genetics model: all pairs of lineages are equallylikely to coalesce within a population.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 47: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

The Coalescent Model Along a Species Tree

I When talking about gene tree distributions, there are twocases of interest:

I The gene tree topology distribution

I The joint distribution of topologies and branch lengths

I Start with the simple case of 3 species with 1 lineage sampledin each and look at the gene tree topology distribution

Stat 882: Statistical Phylogenetics - Lecture 6

Page 48: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Example: Computation of Gene Tree Topology Probabilities for the 3-taxon Case

Example of gene tree probability computation (for simplicity, let’s use coalescent unitsfor our time scale):

(a) Prob = 1 − e−t ; (b), (c), (d) Prob = 13e−t

Stat 882: Statistical Phylogenetics - Lecture 6

Page 49: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Example: Computation of Gene Tree Topology Probabilities for the 3-taxon Case

I Thus, we have the following probabilities:

I Gene tree (A,(B,C)): prob = 1− e−t + 13e−t = 1− 2

3e−t

I Gene tree (B,(A,C)): prob = 13e−t

I Gene tree (C,(A,B)): prob = 13e−t

I Note: There are two ways to get the first gene tree. We callthese histories.

I The probability associated with a gene tree topology will bethe sum over all histories that have that topology.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 50: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Example: Computation of Gene Tree Topology Probabilities for the 3-taxon Case

I What are these probabilities like as a function of t, the lengthof time between speciation events?

B C A

(b)

prob = 1−exp(−t)

B C A

prob = (1/3)exp(−t)

B A C

prob = (1/3)exp(−t)

B C A

prob = (1/3)exp(−t) 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

t (Coalescent Units)

Topo

logy

Pro

babi

lity

(c)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 51: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Example: A Slightly Larger Case

I Consider 4 taxa – the human-chimp-gorilla problem

Stat 882: Statistical Phylogenetics - Lecture 6

Page 52: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Coalescent Histories for the 4-taxon Example

I There are 5 possibilities for this example:

Stat 882: Statistical Phylogenetics - Lecture 6

Page 53: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I In the general case, we have the following:The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Degnan and Salter, Evolution, 2005

Stat 882: Statistical Phylogenetics - Lecture 6

Page 54: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I In the general case, we have the following:The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Degnan and Salter, Evolution, 2005

Stat 882: Statistical Phylogenetics - Lecture 6

Page 55: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Number of terms only known in special cases (Rosenberg, 2007)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 56: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Multiply probabilities associated with history over internal branches(once the number of lineages entering and leaving a branch isknown – which is what is given by the histories – coalescencehappens independently along branches)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 57: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Probability of getting sequence of coalescent events that isconsistent with g

Stat 882: Statistical Phylogenetics - Lecture 6

Page 58: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Probability that u lineages coalescent into v in time tb

Stat 882: Statistical Phylogenetics - Lecture 6

Page 59: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

I The probability of gene tree g given species tree S is given by

P{G = g |S} =∑

histories

P{G = g , history |S}

=∑

histories

∏b

wbPu(b),v(b)(tb)

Length of branch b

Stat 882: Statistical Phylogenetics - Lecture 6

Page 60: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Computing the Topology Distribution by Enumerating Histories

34 J. H. DEGNAN AND L. A. SALTER

TABLE 2. The minimum number of gene trees needed to capture90% of the gene tree distribution as a function of the type of sym-metry of the species tree (a, maximally asymmetric; s, maximallysymmetric), the number of taxa (n), and branch lengths. In the firstthree branch length columns, all branches have the indicated length.The fourth and fifth columns have all branches with length 1.0except the indicated branch. Note that the minimum number of genetrees listed grows more slowly than the number of tree topologiesbased on the number of taxa (see Table 3).

Sym-metry n

Branch lengths

1.0 0.5 0.2 !1 " 0.01 !n#2 " 0.01

a 4 4 7 10 7 9a 5 13 27 58 19 21a 6 33 118 345 51 61a 7 96 512 2239 140 155s 4 4 10 12 10 10s 5 15 35 62 21 26s 6 38 144 441 63 87s 7 140 869 3452 207 363

TABLE 3. The number of valid coalescent histories when the genetree and species tree have the same topology. The number of his-tories is also the number of terms in the outer sum in equation (12).

Taxa

Number of histories

Asymmetric trees Symmetric trees Number of topologies

4 5 4 155 14 10 1056 42 25 9457 132 65 10,3958 429 169 135,1359 1430 481 2,027,02510 4862 1369 34,459,42512 58,786 11,236 13,749,310,57516 9,694,845 1,020,100 6.190 $ 1015

20 1,767,263,190 100,360,324 8.201 $ 1021

FIG. 7. The exact probability of topological equivalence betweenspecies and gene trees as a function of branch lengths and numberof taxa. Probabilities were computed for branch lengths between0.01 and 5.00 in increments of 0.01. Only asymmetric trees wereused for this example. Symmetric trees show a very similar pattern(results not shown).

APPLICATIONS

Probability of Topological Equivalence of Gene Trees and

Species Trees

Because the complete distribution of gene trees for a given

species tree is available, the probability that the gene tree

has the same topology as the species tree can be computed

directly. Figure 7 shows the probability that the gene tree is

topologically equivalent to the species tree when branch

lengths vary continuously from 0.01 to 5.00 (assuming all

branches have the same length) for different numbers of taxa.

This figure can also be used to determine the branch lengths

that would be necessary to have any desired probability that

the gene tree and species tree are topologically equivalent.

Note that even for moderately long branches, the probability

of topological equivalence quickly decreases with the number

of taxa.

Pamilo and Nei (1988) give a conservative upper bound

for this probability,

n#2 2#! iP " 1 # e . (14)!A " #3i"1

From equation (12), the probability of any three-taxon gene

tree matching its species tree is 1 # , and the bound is#!i!ebased on decomposing an n-taxon species tree into n # 2

three-taxon trees, one for each internal branch, and treating

these trees as independent. Here each three-taxon tree con-

sists of an internal branch, its two descendent branches, and

its sister branch. For example, in the seven-taxon tree ex-

ample, the three-taxon tree corresponding to branch 5 has the

branches 2, 3, and 4, and could be represented as (2,(3,4)).

The closeness of this bound to the exact probability can

be evaluated for different tree shapes and sizes as well as

branch lengths using equation (12). Because the assumption

of independence is more nearly met, as Pamilo and Nei (1988)

note, when the branch lengths are larger, the bound is tighter

for trees with longer branches. The bound is also tighter for

trees that are more nearly symmetric (Fig. 8), because for

asymmetric trees lineages are more constrained in their order

of coalescence and are therefore less independent. Although

the bound is fairly close when the branch lengths are mod-

erately large, as the number of taxa increases and branch

lengths are held constant, the ratio of the bound to the exact

probability increases (Fig. 8). This indicates that the bound

is not asymptotically approaching the exact probability.

Notice that P%,!(G " %) and PA only refer to the probabilitythat a random gene tree has the same topology as the fixed

species tree. For a given observed gene tree, the coalescent

model does not provide a method for determining the prob-

ability that the species tree has the same topology as the gene

tree. Because the coalescent model treats the species tree as

a parameter, one could adopt a Bayesian point of view to

assign probabilities to species trees given gene trees. This

would require assigning a prior distribution on the space of

species trees, where the space would include branch lengths

as well as topologies.

Degnan and Salter, Evolution, 2005

Stat 882: Statistical Phylogenetics - Lecture 6

Page 61: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 1

I Motivation: Paper by Ebersberger et al. 2007. Mol. Biol.Evol. 24:2266-2276

I Examined 23,210 distinct alignments for 5 primate taxa:Human, Chimp, Gorilla, Orangutan, Rhesus

I Looked at distribution of gene trees among these taxa -observed strongly supported incongruence only among theHuman-Chimp-Gorilla clade.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 62: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 1

76.6% 11.4% 11.5%

Observed proportions of eachgene tree among ML phylogenies

Stat 882: Statistical Phylogenetics - Lecture 6

Page 63: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 1

76.6% 11.4% 11.5%

Observed proportions of eachgene tree among ML phylogenies

Stat 882: Statistical Phylogenetics - Lecture 6

Page 64: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 1

76.6% 11.4% 11.5%

79.1% 9.9% 9.9%

Observed proportions of eachgene tree among ML phylogenies

Predicted proportions usingparameters from Rannala &Yang, 2003.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 65: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

I In the previous example, one topology is clearly preferred

I Must the distribution always look this way?

I Examine entire distribution when the number of taxa is small

Stat 882: Statistical Phylogenetics - Lecture 6

Page 66: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

I Consider 4 taxa: A, B, C, and D

I Species tree:

A B C D

z

yx

A SpeciesPhylogeny

A B C D

B

Matching Tree (MT)

B A C D

Swapped Tree (ST)

A B C D

Symmetric

Tree 1 (S1)

A C B D

Symmetric

Tree 2 (S2)

A D B C

Symmetric

Tree 3 (S3)

I Look at probabilities of all 15 tree topologies for values of x,y, and z

Stat 882: Statistical Phylogenetics - Lecture 6

Page 67: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

y=1, x=1 y=1, x=0.001

A B C D

y=0.01, x=0.01

Stat 882: Statistical Phylogenetics - Lecture 6

Page 68: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

y=1, x=1 y=1, x=0.001

A B C D

y=0.01, x=0.01

Stat 882: Statistical Phylogenetics - Lecture 6

Page 69: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(1,(2,(3,4)))

((1,2),(3,4))

(2,(1,(3,4)))

((1,3),(2,4))

((1,4),(2,3))

(1,(3,(2,4)))

(1,(4,(2,3)))

(2,(3,(1,4)))

(2,(4,(1,3)))

(3,(4,(1,2)))

(3,(1,(2,4)))

(3,(2,(1,4)))

(4,(1,(2,3)))

(4,(2,(1,3)))

(4,(3,(1,2)))

y=1, x=1 y=1, x=0.001

B A C D

y=0.01, x=0.01

Stat 882: Statistical Phylogenetics - Lecture 6

Page 70: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 2

Degnan and Rosenberg, PLoS Genetics, 2006

Rosenberg and Tao, Systematic Biology, 2008

I The existence of anomalousgene trees has implicationsfor the inference of speciestrees

Stat 882: Statistical Phylogenetics - Lecture 6

Page 71: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

I What about mutation? How does this affect data analysis?

I The coalescent gives a model for determining gene treeprobabilities for each gene.

I View DNA sequence data as the result of a two-stage process:

I Coalescent process generates a gene tree topology.

I Given this gene tree topology, DNA sequences evolve along thetree.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 72: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

I Given this model, how should inference be carried out?

I Hypothesis: As more data (genes) are added, the process ofestimating species trees from concatenated data can bestatistically inconsistent

I May fail to converge to any single tree topology if there aremany equally likely trees.

I May converge to the wrong tree when a gene tree that istopologically incongruent with the species tree has the highestprobability.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 73: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

I Given this model, how should inference be carried out?

I Hypothesis: As more data (genes) are added, the process ofestimating species trees from concatenated data can bestatistically inconsistent

I May fail to converge to any single tree topology if there aremany equally likely trees.

I May converge to the wrong tree when a gene tree that istopologically incongruent with the species tree has the highestprobability.

Stat 882: Statistical Phylogenetics - Lecture 6

Page 74: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

x=0.2, y=1.0

Generate 100 genetrees in COAL

Generate 100 genetrees in COAL

Generate 100 genetrees in COAL

Generate 100 genetrees in COAL

.

.

.

.

.

.

.

.

.

.

.

Estimate species tree

Estimate species tree

Estimate species tree

Estimate species tree

Generate sequence data using Seq-GenEstimate gene tree using PAUP*

Stat 882: Statistical Phylogenetics - Lecture 6

Page 75: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3Simulation Study 1

0 2000 4000 6000

0.0

0.4

0.8

Rela

tive

Freq

uenc

y MT; p=0.3017S1; p=0.3004ST; p=0.2931

A x = 0.01, y = 2.0

0 2000 4000 6000

0.0

0.4

0.8

Number of Genes

Rela

tive

Freq

uenc

y

MT; p=0.2360S1; p=0.2495ST; p=0.2296

E x = 0.01, y = 1.0

0 2000 4000 6000

0.0

0.4

0.8

MT; p=0.2548S1; p=0.2412ST; p=0.2236

B x = 0.05, y = 1.0

0 2000 4000 6000

0.0

0.4

0.8

Number of Genes

MT; p=0.0788S1; p=0.1208ST; p=0.0753S2; p=0.1053S3; p=0.1053

F x = 0.05, y = 0.05

0 2000 4000 6000

0.0

0.4

0.8

MT; p=0.2776S1; p=0.2310ST; p=0.2159

C x = 0.1, y = 1.0

0 2000 4000 6000

0.0

0.4

0.8

Number of Genes

MT; p=0.0887S1; p=0.1190ST; p=0.0799S2; p=0.1043S3; p=0.1043

G x = 0.1, y = 0.05

0 2000 4000 6000

0.0

0.4

0.8

MT; p=0.1334S1; p=0.1335ST; p=0.1038

D x = 0.1568, y = 0.1568

0 2000 4000 6000

0.0

0.4

0.8

Number of Genes

MT; p=0.1038S1; p=0.1051ST; p=0.0791

H x = 0.25, y = 0.01

Stat 882: Statistical Phylogenetics - Lecture 6

Page 76: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3Simulation Study 2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Branch Length (x)

Rel

ativ

e Fr

eque

ncy

of In

ferri

ng M

T 100 genes50 genes20 genes10 genesP(MT)!P(S1)

Stat 882: Statistical Phylogenetics - Lecture 6

Page 77: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

I Performance of the Concatenation Approach:

I Can be statistically inconsistent when branch lengths in thespecies phylogeny are sufficiently small

I May perform poorly even when branch lengths are onlymoderately short

I What should we do? Need to design inference methods thatincorporate the coalescent process.Dennis’s lecture next week

Stat 882: Statistical Phylogenetics - Lecture 6

Page 78: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of the Topology Distribution - Example 3

I Performance of the Concatenation Approach:

I Can be statistically inconsistent when branch lengths in thespecies phylogeny are sufficiently small

I May perform poorly even when branch lengths are onlymoderately short

I What should we do? Need to design inference methods thatincorporate the coalescent process.Dennis’s lecture next week

Stat 882: Statistical Phylogenetics - Lecture 6

Page 79: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Joint Density of Gene Tree Topology and Branch Lengths – An Example

Rannala and Yang 2003

2exp{−23(2)2

tABCE}exp{−2(tABCE − tABCDE )}

2exp{−23(2)2

tABC}exp{−2(τ1 − tABC )}

2exp{−23(2)2

tAB}exp{−2(τ2 − tAB )}

exp{−2τ3}

Stat 882: Statistical Phylogenetics - Lecture 6

Page 80: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Gene Tree Branch Length Distributions

I We now have the following distributionsI p(g |S)I f (g , t|S)

I We can thus, in theory, get the distributions of gene treebranches by simply manipulating these quantities:

f (t|G ,S) =f (g , t|S)

p(g |S)

I Integrating out branches which aren’t of interest gives joint ormarginal distributions

I Can even examine correlations between branch lengths

Stat 882: Statistical Phylogenetics - Lecture 6

Page 81: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Gene Tree Branch Length Distributions

I Complication: Region of integration will change for eachhistory within a given gene tree

I Branch length densities are then a mixture over histories

I For the case of four taxa, James Degnan and I have workedout all joint and marginal distributions

I Simulate data and compare theoretical distribution toobserved distribution

I Correlations are also well-approximated by simulation

Stat 882: Statistical Phylogenetics - Lecture 6

Page 82: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 1

I Simulate 1,000,000gene trees from speciestree((A:1.0,B:1.0):1.0,(C:1.5,D:1.5):0.5);

I Of these, 449,599 hadthe same topology asthe species tree

I Compare observeddistribution of branchlength connecting (A,B)to root node to truedistribution

Density of T3−T1 given gene tree ((AB)(CD))

Number of generations

Dens

ity

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0e+0

01e

−05

2e−0

53e

−05

4e−0

55e

−05

I Good fit between observedand true distributions

Stat 882: Statistical Phylogenetics - Lecture 6

Page 83: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I Estimation of speciation times using information in gene treesis often desirable

I Under the coalescent model (with no gene flow followingspeciation), it must the case that gene divergence timespre-date speciation times

!

!

" # $ % &

!'

!(

!)

*

*+,$"

*%

Figure 1: Example phylogeny showing both the species tree (bold lines) and thegene tree (thin lines). Speciation events are marked by dotted horizontal lines,and species tree branch lengths are the time intervals between these lines (the !i).Coalescent events in the gene tree are marked with closed circles. In this paper,we derive the distribution of T , the time interval between the last coalescent event(reading the tree from bottom to top) and the first speciation event (reading thetree from top to bottom). Note that three gene lineages exist prior to the firstspeciation event in this example.

estimated using DNA sequence data for individual genes (Tavare et al., 1997;Fu, 1997; Tang et al., 2002). It is therefore of interest to understand the re-lationship between speciation times and gene divergence times (Rosenbergand Feldman, 2002).

In this paper, we consider the distribution of the time from the speciationevent at the root of the species tree, TD, to the gene coalescent time for themost recent common ancestor (MRCA) of all genes in the sample, denoted byTMRCA. We define T = TMRCA!TD, where time is measured from the presentto the past, and refer to this di!erence as the SG (species-gene) coalescenttime (see Figure 1). We extend previous work concerning the distribution ofthe SG coalescent time (Takahata and Nei, 1985; Rosenberg and Feldman,2002) to include population trees with an arbitrary number of tips, and weexamine the e!ects of species tree shape, species tree depth, and speciation

I What is the distribution ofthis difference?

I How does it depend onspecies tree shape (e.g.,symmetry) and species treebranch lengths?

Stat 882: Statistical Phylogenetics - Lecture 6

Page 84: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I Estimation of speciation times using information in gene treesis often desirable

I Under the coalescent model (with no gene flow followingspeciation), it must the case that gene divergence timespre-date speciation times

!

!

" # $ % &

!'

!(

!)

*

*+,$"

*%

Figure 1: Example phylogeny showing both the species tree (bold lines) and thegene tree (thin lines). Speciation events are marked by dotted horizontal lines,and species tree branch lengths are the time intervals between these lines (the !i).Coalescent events in the gene tree are marked with closed circles. In this paper,we derive the distribution of T , the time interval between the last coalescent event(reading the tree from bottom to top) and the first speciation event (reading thetree from top to bottom). Note that three gene lineages exist prior to the firstspeciation event in this example.

estimated using DNA sequence data for individual genes (Tavare et al., 1997;Fu, 1997; Tang et al., 2002). It is therefore of interest to understand the re-lationship between speciation times and gene divergence times (Rosenbergand Feldman, 2002).

In this paper, we consider the distribution of the time from the speciationevent at the root of the species tree, TD, to the gene coalescent time for themost recent common ancestor (MRCA) of all genes in the sample, denoted byTMRCA. We define T = TMRCA!TD, where time is measured from the presentto the past, and refer to this di!erence as the SG (species-gene) coalescenttime (see Figure 1). We extend previous work concerning the distribution ofthe SG coalescent time (Takahata and Nei, 1985; Rosenberg and Feldman,2002) to include population trees with an arbitrary number of tips, and weexamine the e!ects of species tree shape, species tree depth, and speciation

I What is the distribution ofthis difference?

I How does it depend onspecies tree shape (e.g.,symmetry) and species treebranch lengths?

Stat 882: Statistical Phylogenetics - Lecture 6

Page 85: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I Estimation of speciation times using information in gene treesis often desirable

I Under the coalescent model (with no gene flow followingspeciation), it must the case that gene divergence timespre-date speciation times

!

!

" # $ % &

!'

!(

!)

*

*+,$"

*%

Figure 1: Example phylogeny showing both the species tree (bold lines) and thegene tree (thin lines). Speciation events are marked by dotted horizontal lines,and species tree branch lengths are the time intervals between these lines (the !i).Coalescent events in the gene tree are marked with closed circles. In this paper,we derive the distribution of T , the time interval between the last coalescent event(reading the tree from bottom to top) and the first speciation event (reading thetree from top to bottom). Note that three gene lineages exist prior to the firstspeciation event in this example.

estimated using DNA sequence data for individual genes (Tavare et al., 1997;Fu, 1997; Tang et al., 2002). It is therefore of interest to understand the re-lationship between speciation times and gene divergence times (Rosenbergand Feldman, 2002).

In this paper, we consider the distribution of the time from the speciationevent at the root of the species tree, TD, to the gene coalescent time for themost recent common ancestor (MRCA) of all genes in the sample, denoted byTMRCA. We define T = TMRCA!TD, where time is measured from the presentto the past, and refer to this di!erence as the SG (species-gene) coalescenttime (see Figure 1). We extend previous work concerning the distribution ofthe SG coalescent time (Takahata and Nei, 1985; Rosenberg and Feldman,2002) to include population trees with an arbitrary number of tips, and weexamine the e!ects of species tree shape, species tree depth, and speciation

I What is the distribution ofthis difference?

I How does it depend onspecies tree shape (e.g.,symmetry) and species treebranch lengths?

Stat 882: Statistical Phylogenetics - Lecture 6

Page 86: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I When considering the distribution of the MRCA of a sampleof lineages is of interest, we can simplify computation of thedistribution

I Let T be the distribution of the difference between thespeciation time and the time of the MRCA of all lineages.Note that

fT |S(t) =k∑

n=2

Pr(L = n|S)Pn1(t)

where L is the random number of lineages available tocoalesce above the root of the species tree

I Pr(L = n|S) can be computed recursively in a peeling-typealgorithm

Efromovich and Kubatko, SAGMB, 2008

Stat 882: Statistical Phylogenetics - Lecture 6

Page 87: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I When considering the distribution of the MRCA of a sampleof lineages is of interest, we can simplify computation of thedistribution

I Let T be the distribution of the difference between thespeciation time and the time of the MRCA of all lineages.Note that

fT |S(t) =k∑

n=2

Pr(L = n|S)Pn1(t)

where L is the random number of lineages available tocoalesce above the root of the species tree

I Pr(L = n|S) can be computed recursively in a peeling-typealgorithm

Efromovich and Kubatko, SAGMB, 2008

Stat 882: Statistical Phylogenetics - Lecture 6

Page 88: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I When considering the distribution of the MRCA of a sampleof lineages is of interest, we can simplify computation of thedistribution

I Let T be the distribution of the difference between thespeciation time and the time of the MRCA of all lineages.Note that

fT |S(t) =k∑

n=2

Pr(L = n|S)Pn1(t)

where L is the random number of lineages available tocoalesce above the root of the species tree

I Pr(L = n|S) can be computed recursively in a peeling-typealgorithm

Efromovich and Kubatko, SAGMB, 2008

Stat 882: Statistical Phylogenetics - Lecture 6

Page 89: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I A particular example

! "#$ %

!&!'!(

!)"

! "# $ %

!&!'

!(

!*"

! " #$ %

!&

!'!(

!+"

Figure 3: The three five-taxon species trees studied by Pamilo and Nei (1988).For tree (a), horizontal dashed lines delimit time intervals (measured in units of2N generations) whose length is denoted by !i for branch i. In (b) and (c), the !i

are assumed to represent the appropriate intervals, although the dashed lines arenot shown. Although the total depths of the trees depicted here are identical, wecomparatively analyze several collections of branch lengths both with and withoutthe assumption of constant depth across topologies.

eages at the root (and as a result for T having the exponential distribution) is0.56, while the probability of having three lineages is 0.42. Another interest-ing observation is that the first branch alone may change the mean time up to40%, while, according to the above discussion, the maximal possible changeover all possible triplets (!1, !2, !3) is 60%. This observation emphasizes thatfor an asymmetric tree the first branch is the most influential factor for theSG coalescent time.

Figure 5 allows us to analyze the probability mass function Pn as a func-tion in !1. As expected, P2 rapidly increases as !1 increases from 0 to 2. Atthe same time, P3 has a mode as a function in !1 and its range is smaller.Also the calculation supports our preliminary conjecture that it is unlikelyto have five gene lineages at the root of the asymmetric tree (a). The latterexplains why the mean SG coalescent time changes only up to 40% as !1

changes from 0.01 to infinity; recall (4).The length !2 of the second branch a!ects the density in a similar manner

for fixed !1 and !3, although the e!ect is dramatically less pronounced. Theinteraction between the e!ects of these two branches on the probability massfunction Pn is exhibited in the top four diagrams of Figure 5. The top–left diagram indicates that the probability of having just two gene lineagesbecomes relatively large only for larger values of !1 and, at the same time,

I When τ2 = τ3 = 1.0, wehave the following:

! " # $ % & ' (

!)!

!)#

!)%

!)'

!)*

")!

+,--./01234256--7892

:25386;

!" ! !)!"<-=# ! !)!"<-=$ ! !)&*<-=% ! !)%!<-=& ! !)!#<-"7 ! ")%!

!" ! !)"!<-=# ! !)!'<-=$ ! !)'#<-=% ! !)$"<-=& ! !)!"<-"7 ! ")$(

!" ! !)&!<-=# ! !)$"<-=$ ! !)&><-=% ! !)"!<-=& ! !)!!<-"7 ! ")#&

!" ! ")!!<-=# ! !)&'<-=$ ! !)%#<-=% ! !)!#<-=& ! !)!!<-"7 ! ")"&

!" ! #)!!<-=# ! !)*%<-=$ ! !)"'<-=% ! !)!!<-=& ! !)!!<-"7 ! ")!&

!" ! $)!!<-=# ! !)>%<-=$ ! !)!'<-=% ! !)!!<-=& ! !)!!<-"7 ! ")!#

Figure 4: Densities fT |Sa(t), calculated according to Equation (3) for the speciestree (a) shown in Figure 3. Di!erent lines correspond to di!erent lengths of thefirst branch !1, with !2 = !3 = 1 and all "i = 1, i = 1, 2, 3. The legend also showscorresponding probabilities Pn = Pr(L = k|S) and mean times µT = E(T |S)calculated by Equation (4). For instance, for the case !1 = 0.01 the probability ofhaving two lineages at the root is 0.01 and the mean time to coalescence is 1.4.

the second branch has little influence on this probability. The situationchanges dramatically for the probability of having three gene lineages (seethe top–right diagram). Here !2 plays an important role whenever !1 issmall, but its role diminishes for larger values of !1. The outcome is similarfor the probability of having four gene lineages, although any pronouncedrelationship holds for even smaller values of !1. Finally, the probability ofhaving five lineages is only around 0.1 even when both !2 and !1 are verysmall. The influence of !3 on all these probabilities is small, and thus we skipdiscussion of the e!ect of !3.

Now let us visualize e!ects of !1 and !2 on the mean SG coalescent timeas well as on the total time, shown in the two bottom diagrams in Figure5. The expected SG coalescent time decreases as either !1 or !2 increases,but the e!ect of !1 is dominant. At the same time, the maximal decrease is

Stat 882: Statistical Phylogenetics - Lecture 6

Page 90: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I A particular example

! "#$ %

!&!'!(

!)"

! "# $ %

!&!'

!(

!*"

! " #$ %

!&

!'!(

!+"

Figure 3: The three five-taxon species trees studied by Pamilo and Nei (1988).For tree (a), horizontal dashed lines delimit time intervals (measured in units of2N generations) whose length is denoted by !i for branch i. In (b) and (c), the !i

are assumed to represent the appropriate intervals, although the dashed lines arenot shown. Although the total depths of the trees depicted here are identical, wecomparatively analyze several collections of branch lengths both with and withoutthe assumption of constant depth across topologies.

eages at the root (and as a result for T having the exponential distribution) is0.56, while the probability of having three lineages is 0.42. Another interest-ing observation is that the first branch alone may change the mean time up to40%, while, according to the above discussion, the maximal possible changeover all possible triplets (!1, !2, !3) is 60%. This observation emphasizes thatfor an asymmetric tree the first branch is the most influential factor for theSG coalescent time.

Figure 5 allows us to analyze the probability mass function Pn as a func-tion in !1. As expected, P2 rapidly increases as !1 increases from 0 to 2. Atthe same time, P3 has a mode as a function in !1 and its range is smaller.Also the calculation supports our preliminary conjecture that it is unlikelyto have five gene lineages at the root of the asymmetric tree (a). The latterexplains why the mean SG coalescent time changes only up to 40% as !1

changes from 0.01 to infinity; recall (4).The length !2 of the second branch a!ects the density in a similar manner

for fixed !1 and !3, although the e!ect is dramatically less pronounced. Theinteraction between the e!ects of these two branches on the probability massfunction Pn is exhibited in the top four diagrams of Figure 5. The top–left diagram indicates that the probability of having just two gene lineagesbecomes relatively large only for larger values of !1 and, at the same time,

I When τ2 = τ3 = 1.0, wehave the following:

! " # $ % & ' (

!)!

!)#

!)%

!)'

!)*

")!

+,--./01234256--7892

:25386;

!" ! !)!"<-=# ! !)!"<-=$ ! !)&*<-=% ! !)%!<-=& ! !)!#<-"7 ! ")%!

!" ! !)"!<-=# ! !)!'<-=$ ! !)'#<-=% ! !)$"<-=& ! !)!"<-"7 ! ")$(

!" ! !)&!<-=# ! !)$"<-=$ ! !)&><-=% ! !)"!<-=& ! !)!!<-"7 ! ")#&

!" ! ")!!<-=# ! !)&'<-=$ ! !)%#<-=% ! !)!#<-=& ! !)!!<-"7 ! ")"&

!" ! #)!!<-=# ! !)*%<-=$ ! !)"'<-=% ! !)!!<-=& ! !)!!<-"7 ! ")!&

!" ! $)!!<-=# ! !)>%<-=$ ! !)!'<-=% ! !)!!<-=& ! !)!!<-"7 ! ")!#

Figure 4: Densities fT |Sa(t), calculated according to Equation (3) for the speciestree (a) shown in Figure 3. Di!erent lines correspond to di!erent lengths of thefirst branch !1, with !2 = !3 = 1 and all "i = 1, i = 1, 2, 3. The legend also showscorresponding probabilities Pn = Pr(L = k|S) and mean times µT = E(T |S)calculated by Equation (4). For instance, for the case !1 = 0.01 the probability ofhaving two lineages at the root is 0.01 and the mean time to coalescence is 1.4.

the second branch has little influence on this probability. The situationchanges dramatically for the probability of having three gene lineages (seethe top–right diagram). Here !2 plays an important role whenever !1 issmall, but its role diminishes for larger values of !1. The outcome is similarfor the probability of having four gene lineages, although any pronouncedrelationship holds for even smaller values of !1. Finally, the probability ofhaving five lineages is only around 0.1 even when both !2 and !1 are verysmall. The influence of !3 on all these probabilities is small, and thus we skipdiscussion of the e!ect of !3.

Now let us visualize e!ects of !1 and !2 on the mean SG coalescent timeas well as on the total time, shown in the two bottom diagrams in Figure5. The expected SG coalescent time decreases as either !1 or !2 increases,but the e!ect of !1 is dominant. At the same time, the maximal decrease is

Stat 882: Statistical Phylogenetics - Lecture 6

Page 91: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I What can we conclude from this?

I Shorter branches lead to more potential for incomplete lineagesorting, which results in longer times to the MRCA

I This effect will be most pronounced for branches that are closeto the root of the tree

I What about symmetry of the tree?

Stat 882: Statistical Phylogenetics - Lecture 6

Page 92: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I What can we conclude from this?

I Shorter branches lead to more potential for incomplete lineagesorting, which results in longer times to the MRCA

I This effect will be most pronounced for branches that are closeto the root of the tree

I What about symmetry of the tree?

Stat 882: Statistical Phylogenetics - Lecture 6

Page 93: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Applications of Branch Length Distributions - Example 2

I Consider a more symmetric tree:

! "#$ %

!&!'!(

!)"

! "# $ %

!&!'

!(

!*"

! " #$ %

!&

!'!(

!+"

Figure 3: The three five-taxon species trees studied by Pamilo and Nei (1988).For tree (a), horizontal dashed lines delimit time intervals (measured in units of2N generations) whose length is denoted by !i for branch i. In (b) and (c), the !i

are assumed to represent the appropriate intervals, although the dashed lines arenot shown. Although the total depths of the trees depicted here are identical, wecomparatively analyze several collections of branch lengths both with and withoutthe assumption of constant depth across topologies.

eages at the root (and as a result for T having the exponential distribution) is0.56, while the probability of having three lineages is 0.42. Another interest-ing observation is that the first branch alone may change the mean time up to40%, while, according to the above discussion, the maximal possible changeover all possible triplets (!1, !2, !3) is 60%. This observation emphasizes thatfor an asymmetric tree the first branch is the most influential factor for theSG coalescent time.

Figure 5 allows us to analyze the probability mass function Pn as a func-tion in !1. As expected, P2 rapidly increases as !1 increases from 0 to 2. Atthe same time, P3 has a mode as a function in !1 and its range is smaller.Also the calculation supports our preliminary conjecture that it is unlikelyto have five gene lineages at the root of the asymmetric tree (a). The latterexplains why the mean SG coalescent time changes only up to 40% as !1

changes from 0.01 to infinity; recall (4).The length !2 of the second branch a!ects the density in a similar manner

for fixed !1 and !3, although the e!ect is dramatically less pronounced. Theinteraction between the e!ects of these two branches on the probability massfunction Pn is exhibited in the top four diagrams of Figure 5. The top–left diagram indicates that the probability of having just two gene lineagesbecomes relatively large only for larger values of !1 and, at the same time,

I Two internal branches adjacent to the root node

I Suggests more possibility of incomplete lineage sorting =⇒longer time to MRCA

Stat 882: Statistical Phylogenetics - Lecture 6

Page 94: Coalescent Theory: An Introduction for Phylogenetics · Population Genetics Models Coalescent Theory for Phylogenetic Inference Why study population genetics? Wright-Fisher Model

Population Genetics ModelsCoalescent Theory for Phylogenetic Inference

Gene Tree Topology DistributionsApplications of the Gene Tree Topology DistributionGene Tree Branch Length Distributions

Coalescent Theory

I We now have the main ideas of the coalescent model and howto apply it to a phylogeny.

I But there are many things we haven’t discussed: migration,recombination, etc.

I Next week: How can we take this model and use it to infer aspecies-level phylogeny?

I Thursday’s lab: Using the program COAL to compute genetree topology probabilities.

Stat 882: Statistical Phylogenetics - Lecture 6