Tree Searching Methods •Exhaustive search (exact) …predrag/classes/2004falli400/swafford.pdfTree...

Preview:

Citation preview

Tree Searching Methods

• Exhaustive search (exact)

• Branch-and-bound search (exact)

• Heuristic search methods (approximate)– Stepwise addition

– Branch swapping

– Star decomposition

Exhaustive Search

12

12

11

12

13

13

13

13

13

13

12

13

13

13

13

Searching for trees

• Generation of all possible trees

1.Generate all 3 trees for first 4 taxa:

Searching for trees

2. Generate all 15 trees for first 5 taxa:

(likewise for each of the other two 4-taxon trees)

Searching for trees

3. Full search tree:

Searching for trees

Branch and bound algorithm:

The search tree is the same asfor exhaustive search, with treelengths for a hypothetical dataset shown in boldface type. If atree lying at a node of thissearch tree has a length thatexceeds the current lower boundon the optimal tree length, thispath of the search tree isterminated (indicated by a cross-bar), and the algorithmbacktracks and takes the nextavailable path. When a tip of thesearch tree is reached (i.e.,when we arrive at a treecontaining the full set of taxa),the tree is either optimal (andhence retained) or suboptimal(and rejected). When all pathsleading from the initial 3-taxontree have been explored, thealgorithm terminates, and allmost-parsimonious trees willhave been identified. Asterisksindicate points at which thecurrent lower bound is reduced.Circled numbers represent theorder in which phylogenetic treesare visited in the search tree.

Stepwise Addition (in a nutshell)

3

2

1

42

31

43

21

34

21

Searching for trees

Stepwise addition

A greedy stepwise-addition search appliedto the example used for branch-and-bound.The best 4-taxon tree is determined byevaluating the lengths of the three treesobtained by joining taxon D to tree 1containing only the first three taxa. Taxa Eand F are then connected to the five andseven possible locations, respectively, ontrees 4 and 9, with only the shortest treesfound during each step being used for thenext step. In this example, the 233-step treeobtained is not a global optimum. Circlednumbers indicate the order in whichphylogenetic trees are evaluated in thestepwise-addition search.

Stepwise Addition Variants

• As Is– add in order found in matrix

• Closest– add unplaced taxa that requires smallest increase

• Furthest– add unplaced taxa that requires largest increase

• Simple– Farris’s (1970) “simple algorithm” uses a set of pairwise

reference distances

• Random– random permutation of taxa is used to select the order

Branch swappingNearest Neighbor Interchange (NNI)

EA

CB

D

A

D

E

CB

DA

CB

E

Branch swappingSubtree Pruning and Regrafting (SPR)

D

AB

C

GF

E

"D

GF

E

AB

C

G

DE

F

BA

C

a

Branch swappingTree Bisection and Reconnection (TBR)

D

AB

C

GF

ED

GF

E

AB

C

G

DE

F

BC

A

G

DE

F

BA

C

G

DE

F

CA

B

"

Reconnection limits in TBR

1

2 3 45

6

x zy

r

s

t u v

w

1

2 3 45

6

x zx'

u v

w1

2 4 3 5

6

1

2 3 45

6

0 01

1

2

2

Reconnection distances:

(D)

1

2 3 45

6

y

r

s

v

wy'

3

1 2 54

6

01

1

2 3 45

6

1

1

1

0Reconnection distances:

In PAUP*, use “ReconLim” to set maximum reconnection distance

Reconnection limits in TBR

Star-decomposition search

Overview of maximum likelihood as usedin phylogenetics

• Overall goal: Find a tree topology (and associated parameter estimates)that maximizes the probability of obtaining the observed data, given amodel of evolution

Likelihood(hypothesis) µProb(data|hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|tree,model)

[not Prob(tree|data,model)]

Computing the likelihood of a single tree

1 j N(1) C…GGACA…C…GTTTA…C(2) C…AGACA…C…CTCTA…C(3) C…GGATA…A…GTTAA…C(4) C…GGATA…G…CCTAG…C

(1)

(2)

(3)

(4)

CC A G

(6)

(5)

Computing the likelihood of a single tree

Prob

CC A G

A

A

Likelihood at site j =

+ Prob

CC A G

A

C

Prob

CC A G

T

T+ … +

But use Felsenstein (1981) pruning algorithm

Computing the likelihood of a single tree

L = L1L2LLN = L jj=1

N

lnL = ln L1 + lnL2 +Lln LN = lnL1j=1

N

Â

Note: PAUP* reports -ln L, so lower -ln L implies higher likelihood

Finding the maximum-likelihood tree(in principle)

• Evaluate the likelihood of each possibletree for a given collection of taxa.

• Choose the tree topology whichmaximizes the likelihood over allpossible trees.

Probability calculations require…

• An explicit model of substitution that specifies changeprobabilities for a given branch length

“Instantaneous rate matrix”

Jukes-CantorKimura 2-parameterHasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984General time-reversible

Q =

p ArAA p CrAC p GrAG p T rAT

p ArCA p CrCC p GrCG p T rCT

p ArGA p CrGC p GrGG p T rGT

p ArTA p CrTC p GrTG p T rTT

Ê

Ë

Á Á Á Á

ˆ

¯

˜ ˜ ˜ ˜

P(v) = eQn

• An estimate of optimal branch lengths in units ofexpected amount of change (n = rate x time)

For example:

Q =

- a a a

a - a a

a a - a

a a a -

Ê

Ë

Á Á Á Á

ˆ

¯

˜ ˜ ˜ ˜

Jukes-Cantor (1969)

Q =

- b a b

b - b a

a b - b

b a b -

Ê

Ë

Á Á Á Á

ˆ

¯

˜ ˜ ˜ ˜

Kimura (1980) “2-parameter”

Q =

- p Cb p Ga p Tb

p Ab - p Gb p Ta

p Aa p Cb - p Tb

p Ab p Ca p Gb -

Ê

Ë

Á Á Á Á

ˆ

¯

˜ ˜ ˜ ˜

Hasegawa-Kishino-Yano (1985)

Q =

p ArAA p CrAC p GrAG p T rAT

p ArCA p CrCC p GrCG p T rCT

p ArGA p CrGC p GrGG p T rGT

p ArTA p CrTC p GrTG p T rTT

Ê

Ë

Á Á Á Á

ˆ

¯

˜ ˜ ˜ ˜

General-Time Reversible

E.g., transition probabilities forHKY and F84:

Pij t( ) =

p j +p j1

P j

-1Ê

Ë Á Á

ˆ

¯ ˜ ˜ e

-mn +P j -p j

P j

Ê

Ë Á Á

ˆ

¯ ˜ ˜ e

- mnA (i = j)

p j +p j1

P j

-1Ê

Ë Á Á

ˆ

¯ ˜ ˜ e

-mn -p j

P j

Ê

Ë Á Á

ˆ

¯ ˜ ˜ e

- mnA (i ≠ j, transition)

p j 1 - e-mn( ) (i ≠ j, transversion)

Ï

Ì

Ô Ô Ô Ô Ô

Ó

Ô Ô Ô Ô Ô

A Family of Reversible Substitution Models

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

The Relevance of Branch LengthsC C A A A A A A A A

A

C

C C A A A A A A A A

CA

When does maximum likelihood workbetter than parsimony?

• When you’re in the “Felsenstein Zone”

A C

B D

(Felsenstein, 1978)

In the Felsenstein Zone

A C G TA - 5 6 2C 5 - 3 8G 6 3 - 1T 2 8 1 -

Substitution rates:

Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4

A B

C D

0.1

0.1 0.1

0.8 0.8

In the Felsenstein Zone

0

0.2

0.4

0.6

0.8

1

0 5000 10000Sequence Length

parsimonyML-GTR

Pro

port

ion

corr

ect

The long-branch attraction (LBA) problem

Pattern type

1 4A I = Uninformative (constant) A

A A 2 3

The true phylogeny of1, 2, 3 and 4

(zero changes required on anytree)

The long-branch attraction (LBA) problem

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative G

A A 2 3

The true phylogeny of1, 2, 3 and 4

(one change required on any tree)

The long-branch attraction (LBA) problem

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative GC III = Uninformative G

A A 2 3

The true phylogeny of1, 2, 3 and 4

(two changes required on any tree)

The long-branch attraction (LBA) problem

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative GC III = Uninformative GG IV = Misinformative G

A A 2 3

The true phylogeny of1, 2, 3 and 4

(two changes required on true tree)

The long-branch attraction (LBA) problem

G 4

A 2

A 3

G 1

… but this tree needs only one step

Concerns about statistical propertiesand suitability of models

(assumptions)

Consistency

If an estimator converges to the true value of aparameter as the amount of data increases towardinfinity, the estimator is consistent.

When do both methods fail?

• When there is insufficient phylogenetic signal...

2

1 3

4

When does parsimony work “better”than maximum likelihood?

• When you’re in the Inverse-Felsenstein (“Farris”) zone

A

B

C

D

(Siddall, 1998)

Siddall (1998) parameter space

a

a

b

b

b

Both methods do poorly

Parsimony has higheraccuracy than likelihood

Both methods do well

pa

pb0 0.75

0.75

Parsimony vs. likelihood in the Inverse-Felsenstein Zone

B

BB B B B B B B B B B

J J

JJ J J J

J

J

J

J

J

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

Sequence length

B

J

ParsimonyML/JC

15%67.5%

67.5%

(expected differences/site)

Acc

urac

y

Why does parsimony do so well in theInverse-Felsenstein zone?

A

A

C

C

AC

A

A

C

C

AG

A

C G

C

A

A

C

CAC

AC

True synapomorphy

Apparent synapomorphiesactually due tomisinterpreted homoplasy

Parsimony vs. likelihood in the Felsenstein Zone

B

B

BB B B B B B B B B

JJ

J

J

J

J

J

J J J J J

15%

67.5% 67.5%

Acc

urac

y

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 100 1,000 10,000 100,000

B

J

ParsimonyML/JC

(expected differences/site)

Sequence length

From the Farris Zone to the Felsenstein Zone

C

D

A

B

C

D

A

B

C

D

A

B

B

C

D

A

B

D

C

A

External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

JJ

JJ

JJ

J

J

J

J

G

G

G

G

G

G

G

GH H HH

H

H H

0

0.2

0.4

0.6

0.8

1.0

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

J 100 sitesG 1,000 sitesH 10,000 sites ML/JC

Length of internal branch ( d)Farris zone Felsenstein zone

H GHGH

JGH

GH

JJ

JJ

J

JJ

0

0.2

0.4

0.6

0.8

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05Length of internal branch ( d)Farris zone Felsenstein zone

J 100 sitesG 1,000 sitesH 10,000 sites

JHG GH GHGHGH

J J

GHGH HG

1.0

J

GH

HGGHJJ

HG

Accu

racy

Accu

racy

Parsimony

Likelihood

Simulationresults:

Maximum likelihood models areoversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?

• Not necessarily (maximum likelihood is pretty robust)

Model used for simulation...

A C G TA - 5 6 2C 5 - 3 8G 6 3 - 1T 2 8 1 -

Substitution rates:

Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4

A B

C D

0.1

0.1 0.1

0.8 0.8

Performance of ML when its model isviolated (one example)

0

0.2

0.4

0.6

0.8

1

100 1000 10000Sequence Length

parsimonyML-JCML-K2PML-HKYML-GTR

Among site rate heterogeneity

• Proportion of invariable sites– Some sites don’t change do to strong functional or structural constraint (Hasegawa et

al., 1985)

• Site-specific rates– Different relative rates assumed for pre-assigned subsets of sites

• Gamma-distributed rates– Rate variation assumed to follow a gamma distribution with shape parameter a

Lemur AAGCTTCATAG TTGCATCATCCA …TTACATCATCCAHomo AAGCTTCACCG TTGCATCATCCA …TTACATCCTCATPan AAGCTTCACCG TTACGCCATCCA …TTACATCCTCATGoril AAGCTTCACCG TTACGCCATCCA …CCCACGGACTTAPongo AAGCTTCACCG TTACGCCATCCT …GCAACCACCCTCHylo AAGCTTTACAG TTACATTATCCG …TGCAACCGTCCTMaca AAGCTTTTCCG TTACATTATCCG …CGCAACCATCCT

equal rates?

Performance of ML when its model isviolated (another example)

.....

0

0.02

0.04

0.06

0.08

0 1 2

Rate

a=50

a=200

Modeling among-site rate variation with a gamma distribution...

…can also estimate a proportion of “invariable” sites (pinv)

a=2

a=0.5

Fre

quen

cy

Performance of ML when its model isviolated (another example)

Sequence Length

Prop

ortio

n Co

rrect

Tree a = 0.5, pinv=0.5 a = 1.0, pinv=0.5 a = 1.0, pinv=0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRig

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

GTRig

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRig

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

0

0.1

0.2

0.3

0.4

0.5

0.60.7

0.8

0.9

1

100 1000 10000 100000

GTRig

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

GTRig

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

HKYig

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

“MODERATE”–Felsenstein zone

a = 1.0, pinv=0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

“MODERATE”–Inverse-Felsenstein zone

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Bayesian Inference in Phylogenetics

• Uses Bayes formula:Pr(q|D) = Pr(D|q) Pr(q) Pr(D)

µ Pr(D|q) Pr(q)

µ L(q) Pr(q)

• Calculation involves integrating over all treetopologies and model-parameter values,subject to assumed prior distribution onparameters

(q =tree topology,branch-lengths, andsubstitution-modelparameters)

Bayesian Inference in Phylogenetics

• To approximate this posterior density (complicatedmultidimensional integral) we use Markov chain Monte Carlo(MCMC)– Simulated Markov chain in which transition probabilities are

assigned such that the stationary distribution of the chain isthe posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposedmove from one state q to another state q* with probabilitymin(r,1) where

r = Pr(q*|D) Pr(q| q*)Pr(q|D) Pr(q*| q)

– Sample chain at regular intervals to approximate posteriordistribution

• MrBayes (by John Huelsenbeck and Fredrik Ronquist) is mostpopular Bayesian inference program

AB

C D

AB

C D

Like

lihoo

d

Iterations

A brief intro to Markov chain Monte Carlo (MCMC)

A

B

C D

...

If the chain is run “long enough”, the stationary distribution of states in the chain will represent agood approximation to the target distribution (in this case, the Bayesian posterior)

1. Initialize the chain, e.g., by picking a random state X0 (topology,branch lengths, substitution-modelparameters) from the assumed prior distribution

A

B

C

D

AB|CD

A

B

C

D

AB|CD

AB

C D

BC|AD

AB

C D

BC|AD

AB

C D

BC|AD

AB

C D

BC|AD

B

CD

A

AC|BDAB|CD

A

B

C

D

a(X,Y ) = min 1, Pr Y | D( )q(X |Y )Pr X | D( )q(Y | X)

Ê

Ë Á

ˆ

¯ ˜ = min 1, p (Y)

p (X)¥

Pr(D |Y)Pr(D | X)

¥q(X |Y )q(X |Y )

Ê

Ë Á

ˆ

¯ ˜

2. For each time t, sample a new candidate state Y from some proposal distribution q(.|Xt) (e.g.,change branch lengths or topology plus branch lengths)

Calculate acceptance probability

3. If Y is accepted, let Xt+1 = Y; otherwise let Xt+1 = Xt

“burn in”

Model-based distances• Can also calculate pairwise distances based on these models

• These distances estimate the number of substitutions per sitethat have accumulated since the two sequences shared acommon ancestor, allowing for superimposed substitutions(“multiple hits”)

• E.g.:

– Jukes-Cantor distance

– Kimura 2-parameter distance

– General maximum-likelihood distances available for othermodels

1 3

42

a d

ec

b

-

d12 -

d13 d23 -

d14 d24 d34 -

1

2

3

4

1 2 3 4

p12 = a+bp13 = a+c+dp14 = a+c+ep23 = b+c+dp24 = b+c+ep34 = d+e

pij = dij for all i and j if the treetopology is correct and distancesare additive

Distance-based optimality criteria“Additive trees”

Distances in general will not be additive, sochoose optimal tree according to one of the

following criteria (objective functions):

"Goodness - of - fit" : minimize wij pij - diji < jÂ

r

Typically, r = 2 (least-squares) and wij = 1/dij2 ("Fitch-

Margoliash" method)

"Minimum - evolution" : minimize vkk= 1

#branches

 or vkk =1

# branches

Â

Distance-based optimality criteriaMinimum evolution and least-squares

Pongo

Lemur catta

Pan

Homo sapiens

Gorilla0.044

0.0850.286

0.015

0.0500.045

0.050

0.39646 0.39021 0.0000390.39838 0.39602 0.0000060.09506 0.09507 0.0000000.37222 0.38084 0.0000740.11172 0.11011 0.0000030.11431 0.11592 0.0000030.37096 0.37096 0.0000000.18107 0.18894 0.0000620.19399 0.19475 0.0000010.18820 0.17958 0.000074

0.000261

pijdij SS

Least-Squares

0.286110.044360.015110.044630.050440.050380.084850.57588

Minumumevolution(ME)

LS branch lengths

Recommended