14
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Exact Computation of Coalescent Likelihood under the Infinite Sites

Model

Yufeng Wu

University of Connecticut

ISBRA 2009 1

Page 2: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Coalescent Likelihood

• D: a set of binary sequences.• Coalescent genealogy: history with

coalescent and mutation events.• Coalescent likelihood P(D):

probability of observing D on coalescent model given mutation rate

• Assume no recombination. 00000 00010 00010 00010 01100 10000 10001

1

5

2

3

4

Coalescent Mutation

Page 3: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Perfect Phylogeny• Infinite many sites model of

mutations: one mutation per site in history

• Perfect phylogeny– Site labels tree branches– Each site appears exactly once.– Sequence: list of mutations from

root to leaf.– Unique topologically, except root is

unknown and order of mutations on the same branch is not fixed.

00000

01100

3

Page 4: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Genealogy and Perfect Phylogeny

• Perfect phylogeny: not enough timing information

– Exists many coalescent genealogy for a fixed perfect phylogeny.– Each genealogy: different probability (depending on )

• Coalescent likelihood: sum over all compatible genealogy.

1

5

2

3

4

1

5

2

3

4

1

5

2

3

4

Page 5: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Computing Coalescent Likelihood• Computation of P(D): classic population genetics problem.

Statistical (inexact) approaches:– Importance sampling (IS): Griffiths and Tavare (1994), Stephens and

Donnelly (2000), Hobolth, Uyenoyama and Wiuf (2008).– MCMC: Kuhner,Yamato and Felsenstein (1995).

• Genetree: IS-based, widely used but (sometimes large) variance still exists.

• How feasible of computing exact P(D)?– Considered to be difficult for even medium-sized data (Song, Lyngso and

Hein, 2006).

• This talk: exact computation of P(D) is feasible for data significantly larger than previously believed.– A simple algorithmic trick: dynamic programming

5

Page 6: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Ethier-Griffiths Recursion• Build a perfect phylogeny for D.• Ancestral configuration (AC): pairs of

sequence multiplicity and list of mutations for each sequence type at some time

• Transition probability between ACs: depends on AC and .

• Genealogy: path of ACs (from present to root)

• P(D): sum of probability of all paths.• EG: faster summation, backwards in time.

(1, 0), (3, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (1, 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (2, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (1, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(3, 4 0)6

Page 7: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Computing Exact Likelihood• Key idea: forward instead of backwards

– Create all possible ACs reachable from the current AC (start from root). Update probability.

– Intuition of AC: growing coverage of the phylogeny, starting from root

• Possible events at root: three branching (b1, b2, b3), three mutations (m1, m2, m4).

• Branching: cover new branch

• Covered branch can mutate

• Each event: a new AC

b1

m2

b2Start from root AC

7

Page 8: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Forward Finding of ACs • Finding all ACs by forward looking.

• Maintain a list of active events in each AC. Update in the new ACs.

• Rule: at a node in phylogeny, mutated branches covered branches (unless all branches are covered)

Mut branch = covered branch

8

Page 9: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Why Forward?• Bottleneck: memory • Layer of ACs: ACs

with k mutation or branching events from root AC, k= 1,2,3…

• Key: only the current layer needs to be kept. Memory efficient.

• A single forward pass is enough to compute P(D).

9

Coalescent

Mutation

Page 10: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Results on Simulated Data• Use Hudson’s program ms: 20, 30 , 40 and 50

sequences with = 1, 3 and 5. Each settings: 100 datasets. How many allow exact computation of P(D) within reasonable amount of time?

Number of sequences

% of feasible data

Number of sequences

Ave. run time (sec.) for feasible data

10

Page 11: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Application: MLE of Mutation Rate• Given a set of sequences D, what is the

maximum likelihood estimate of the mutation rate ?

• Issue: Need to compute for many possible and root of genealogy is not known.

• Use exact likelihood– Compute P(D | ) for on a grid.– MLE of : maximize P(D | ).– Full likelihood: sum P(D | ) over all possible roots.– Faster computation: compute P(D) for multiple in

one pass. Quicker than computing P(D) for one each time.

11

P(D)

MLE()

Page 12: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

A Mitochondrial Data• Mitochondrial data from Ward, et al. (1991). Previously analyzed

by Griffiths and Tavare (1994) and others.– 55 sequences and 18 polymorphic sites.

– Believed to fit the infinite sites model.

12

Page 13: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

MLE of Mutation Rate for the Mitochondrial Data

• MLE of : 4.8 Griffiths and Tavare (1994)

• IS methods can have variance– Is 4.8 really the MLE?

• Compute the exact full likelihood for a grid of between 4.0 to 6.0.

13

Page 14: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Conclusion• IS seems to work well for the Mitochondrial

data– However, IS can still have large variance for some

data.– Thus, exact computation may help when data is

not very large and/or relatively low mutation rate.– Can also help to evaluate different statistical

methods.

• Research supported by National Science Foundation.

14