Tree Searching Methods •Exhaustive search (exact) …predrag/classes/2004falli400/swafford.pdfTree...

Tree Searching Methods

• Exhaustive search (exact)

• Branch-and-bound search (exact)

• Heuristic search methods (approximate)– Stepwise addition

– Branch swapping

– Star decomposition

Exhaustive Search

Searching for trees

• Generation of all possible trees

1.Generate all 3 trees for first 4 taxa:

Searching for trees

2. Generate all 15 trees for first 5 taxa:

(likewise for each of the other two 4-taxon trees)

Searching for trees

3. Full search tree:

Searching for trees

Branch and bound algorithm:

The search tree is the same asfor exhaustive search, with treelengths for a hypothetical dataset shown in boldface type. If atree lying at a node of thissearch tree has a length thatexceeds the current lower boundon the optimal tree length, thispath of the search tree isterminated (indicated by a cross-bar), and the algorithmbacktracks and takes the nextavailable path. When a tip of thesearch tree is reached (i.e.,when we arrive at a treecontaining the full set of taxa),the tree is either optimal (andhence retained) or suboptimal(and rejected). When all pathsleading from the initial 3-taxontree have been explored, thealgorithm terminates, and allmost-parsimonious trees willhave been identified. Asterisksindicate points at which thecurrent lower bound is reduced.Circled numbers represent theorder in which phylogenetic treesare visited in the search tree.

Stepwise Addition (in a nutshell)

Searching for trees

Stepwise addition

A greedy stepwise-addition search appliedto the example used for branch-and-bound.The best 4-taxon tree is determined byevaluating the lengths of the three treesobtained by joining taxon D to tree 1containing only the first three taxa. Taxa Eand F are then connected to the five andseven possible locations, respectively, ontrees 4 and 9, with only the shortest treesfound during each step being used for thenext step. In this example, the 233-step treeobtained is not a global optimum. Circlednumbers indicate the order in whichphylogenetic trees are evaluated in thestepwise-addition search.

Stepwise Addition Variants

• As Is– add in order found in matrix

• Closest– add unplaced taxa that requires smallest increase

• Furthest– add unplaced taxa that requires largest increase

• Simple– Farris’s (1970) “simple algorithm” uses a set of pairwise

reference distances

• Random– random permutation of taxa is used to select the order

Branch swappingNearest Neighbor Interchange (NNI)

Branch swappingSubtree Pruning and Regrafting (SPR)

Branch swappingTree Bisection and Reconnection (TBR)

Reconnection limits in TBR

2 3 45

2 4 3 5

2 3 45

Reconnection distances:

2 3 45

1 2 54

2 3 45

0Reconnection distances:

In PAUP*, use “ReconLim” to set maximum reconnection distance

Reconnection limits in TBR

Star-decomposition search

Overview of maximum likelihood as usedin phylogenetics

• Overall goal: Find a tree topology (and associated parameter estimates)that maximizes the probability of obtaining the observed data, given amodel of evolution

Likelihood(hypothesis) µProb(data|hypothesis)

Likelihood(tree,model) = k Prob(observed sequences|tree,model)

[not Prob(tree|data,model)]

Computing the likelihood of a single tree

1 j N(1) C…GGACA…C…GTTTA…C(2) C…AGACA…C…CTCTA…C(3) C…GGATA…A…GTTAA…C(4) C…GGATA…G…CCTAG…C

CC A G

Likelihood at site j =

+ Prob

CC A G

T+ … +

But use Felsenstein (1981) pruning algorithm

L = L1L2LLN = L jj=1

lnL = ln L1 + lnL2 +Lln LN = lnL1j=1

Note: PAUP* reports -ln L, so lower -ln L implies higher likelihood

Finding the maximum-likelihood tree(in principle)

• Evaluate the likelihood of each possibletree for a given collection of taxa.

• Choose the tree topology whichmaximizes the likelihood over allpossible trees.

Probability calculations require…

• An explicit model of substitution that specifies changeprobabilities for a given branch length

“Instantaneous rate matrix”

Jukes-CantorKimura 2-parameterHasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984General time-reversible

p ArAA p CrAC p GrAG p T rAT

p ArCA p CrCC p GrCG p T rCT

p ArGA p CrGC p GrGG p T rGT

p ArTA p CrTC p GrTG p T rTT

Á Á Á Á

˜ ˜ ˜ ˜

P(v) = eQn

• An estimate of optimal branch lengths in units ofexpected amount of change (n = rate x time)

For example:

- a a a

a - a a

a a - a

a a a -

Á Á Á Á

˜ ˜ ˜ ˜

Jukes-Cantor (1969)

- b a b

b - b a

a b - b

b a b -

Á Á Á Á

˜ ˜ ˜ ˜

Kimura (1980) “2-parameter”

- p Cb p Ga p Tb

p Ab - p Gb p Ta

p Aa p Cb - p Tb

p Ab p Ca p Gb -

Á Á Á Á

˜ ˜ ˜ ˜

Hasegawa-Kishino-Yano (1985)

p ArAA p CrAC p GrAG p T rAT

p ArCA p CrCC p GrCG p T rCT

p ArGA p CrGC p GrGG p T rGT

p ArTA p CrTC p GrTG p T rTT

Á Á Á Á

˜ ˜ ˜ ˜

General-Time Reversible

E.g., transition probabilities forHKY and F84:

Pij t( ) =

p j +p j1

Ë Á Á

¯ ˜ ˜ e

-mn +P j -p j

Ë Á Á

¯ ˜ ˜ e

- mnA (i = j)

p j +p j1

Ë Á Á

¯ ˜ ˜ e

-mn -p j

Ë Á Á

¯ ˜ ˜ e

- mnA (i ≠ j, transition)

p j 1 - e-mn( ) (i ≠ j, transversion)

Ô Ô Ô Ô Ô

A Family of Reversible Substitution Models

SYMTrN

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

The Relevance of Branch LengthsC C A A A A A A A A

C C A A A A A A A A

When does maximum likelihood workbetter than parsimony?

• When you’re in the “Felsenstein Zone”

(Felsenstein, 1978)

In the Felsenstein Zone

A C G TA - 5 6 2C 5 - 3 8G 6 3 - 1T 2 8 1 -

Substitution rates:

Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4

0.1 0.1

0.8 0.8

In the Felsenstein Zone

0 5000 10000Sequence Length

parsimonyML-GTR

The long-branch attraction (LBA) problem

Pattern type

1 4A I = Uninformative (constant) A

A A 2 3

The true phylogeny of1, 2, 3 and 4

(zero changes required on anytree)

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative G

A A 2 3

(one change required on any tree)

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative GC III = Uninformative G

A A 2 3

(two changes required on any tree)

Pattern type

1 4A I = Uninformative (constant) AA II = Uninformative GC III = Uninformative GG IV = Misinformative G

A A 2 3

(two changes required on true tree)

… but this tree needs only one step

Concerns about statistical propertiesand suitability of models

(assumptions)

Consistency

If an estimator converges to the true value of aparameter as the amount of data increases towardinfinity, the estimator is consistent.

When do both methods fail?

• When there is insufficient phylogenetic signal...

When does parsimony work “better”than maximum likelihood?

• When you’re in the Inverse-Felsenstein (“Farris”) zone

(Siddall, 1998)

Siddall (1998) parameter space

Both methods do poorly

Parsimony has higheraccuracy than likelihood

Both methods do well

pb0 0.75

Parsimony vs. likelihood in the Inverse-Felsenstein Zone

BB B B B B B B B B B

JJ J J J

20 100 1,000 10,000 100,000

Sequence length

ParsimonyML/JC

15%67.5%

(expected differences/site)

Why does parsimony do so well in theInverse-Felsenstein zone?

True synapomorphy

Apparent synapomorphiesactually due tomisinterpreted homoplasy

Parsimony vs. likelihood in the Felsenstein Zone

BB B B B B B B B B

J J J J J

67.5% 67.5%

20 100 1,000 10,000 100,000

ParsimonyML/JC

(expected differences/site)

Sequence length

From the Farris Zone to the Felsenstein Zone

External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

GH H HH

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

J 100 sitesG 1,000 sitesH 10,000 sites ML/JC

Length of internal branch ( d)Farris zone Felsenstein zone

H GHGH

0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05Length of internal branch ( d)Farris zone Felsenstein zone

J 100 sitesG 1,000 sitesH 10,000 sites

JHG GH GHGHGH

GHGH HG

HGGHJJ

Parsimony

Likelihood

Simulationresults:

Maximum likelihood models areoversimplifications of reality. If I assume the

wrong model, won’t my results be meaningless?

• Not necessarily (maximum likelihood is pretty robust)

Model used for simulation...

A C G TA - 5 6 2C 5 - 3 8G 6 3 - 1T 2 8 1 -

Substitution rates:

Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4

0.1 0.1

0.8 0.8

Performance of ML when its model isviolated (one example)

100 1000 10000Sequence Length

parsimonyML-JCML-K2PML-HKYML-GTR

Among site rate heterogeneity

• Proportion of invariable sites– Some sites don’t change do to strong functional or structural constraint (Hasegawa et

al., 1985)

• Site-specific rates– Different relative rates assumed for pre-assigned subsets of sites

• Gamma-distributed rates– Rate variation assumed to follow a gamma distribution with shape parameter a

Lemur AAGCTTCATAG TTGCATCATCCA …TTACATCATCCAHomo AAGCTTCACCG TTGCATCATCCA …TTACATCCTCATPan AAGCTTCACCG TTACGCCATCCA …TTACATCCTCATGoril AAGCTTCACCG TTACGCCATCCA …CCCACGGACTTAPongo AAGCTTCACCG TTACGCCATCCT …GCAACCACCCTCHylo AAGCTTTACAG TTACATTATCCG …TGCAACCGTCCTMaca AAGCTTTTCCG TTACATTATCCG …CGCAACCATCCT

equal rates?

Performance of ML when its model isviolated (another example)

Modeling among-site rate variation with a gamma distribution...

…can also estimate a proportion of “invariable” sites (pinv)

Performance of ML when its model isviolated (another example)

Sequence Length

Tree a = 0.5, pinv=0.5 a = 1.0, pinv=0.5 a = 1.0, pinv=0.2

100 1000 10000 100000

GTRgHKYgGTRiHKYiGTRerHKYerparsimony

100 1000 10000 100000

GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony

100 1000 10000 100000

GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony

100 1000 10000 100000

GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony

0.60.7

100 1000 10000 100000

“MODERATE”–Felsenstein zone

a = 1.0, pinv=0.5

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

“MODERATE”–Inverse-Felsenstein zone

100 1000 10000 100000

JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony

Bayesian Inference in Phylogenetics

• Uses Bayes formula:Pr(q|D) = Pr(D|q) Pr(q) Pr(D)

µ Pr(D|q) Pr(q)

µ L(q) Pr(q)

• Calculation involves integrating over all treetopologies and model-parameter values,subject to assumed prior distribution onparameters

(q =tree topology,branch-lengths, andsubstitution-modelparameters)

Bayesian Inference in Phylogenetics

• To approximate this posterior density (complicatedmultidimensional integral) we use Markov chain Monte Carlo(MCMC)– Simulated Markov chain in which transition probabilities are

assigned such that the stationary distribution of the chain isthe posterior density of interest

– E.g., Metropolis-Hastings algorithm: Accept a proposedmove from one state q to another state q* with probabilitymin(r,1) where

r = Pr(q*|D) Pr(q| q*)Pr(q|D) Pr(q*| q)

– Sample chain at regular intervals to approximate posteriordistribution

• MrBayes (by John Huelsenbeck and Fredrik Ronquist) is mostpopular Bayesian inference program

Iterations

A brief intro to Markov chain Monte Carlo (MCMC)

If the chain is run “long enough”, the stationary distribution of states in the chain will represent agood approximation to the target distribution (in this case, the Bayesian posterior)

1. Initialize the chain, e.g., by picking a random state X0 (topology,branch lengths, substitution-modelparameters) from the assumed prior distribution

AC|BDAB|CD

a(X,Y ) = min 1, Pr Y | D( )q(X |Y )Pr X | D( )q(Y | X)

¯ ˜ = min 1, p (Y)

p (X)¥

Pr(D |Y)Pr(D | X)

¥q(X |Y )q(X |Y )

2. For each time t, sample a new candidate state Y from some proposal distribution q(.|Xt) (e.g.,change branch lengths or topology plus branch lengths)

Calculate acceptance probability

3. If Y is accepted, let Xt+1 = Y; otherwise let Xt+1 = Xt

“burn in”

Model-based distances• Can also calculate pairwise distances based on these models

• These distances estimate the number of substitutions per sitethat have accumulated since the two sequences shared acommon ancestor, allowing for superimposed substitutions(“multiple hits”)

• E.g.:

– Jukes-Cantor distance

– Kimura 2-parameter distance

– General maximum-likelihood distances available for othermodels

d13 d23 -

d14 d24 d34 -

1 2 3 4

p12 = a+bp13 = a+c+dp14 = a+c+ep23 = b+c+dp24 = b+c+ep34 = d+e

pij = dij for all i and j if the treetopology is correct and distancesare additive

Distance-based optimality criteria“Additive trees”

Distances in general will not be additive, sochoose optimal tree according to one of the

following criteria (objective functions):

"Goodness - of - fit" : minimize wij pij - diji < jÂ

Typically, r = 2 (least-squares) and wij = 1/dij2 ("Fitch-

Margoliash" method)

"Minimum - evolution" : minimize vkk= 1

#branches

Â or vkk =1

# branches

Distance-based optimality criteriaMinimum evolution and least-squares

Lemur catta

Homo sapiens

Gorilla0.044

0.0850.286

0.0500.045

0.39646 0.39021 0.0000390.39838 0.39602 0.0000060.09506 0.09507 0.0000000.37222 0.38084 0.0000740.11172 0.11011 0.0000030.11431 0.11592 0.0000030.37096 0.37096 0.0000000.18107 0.18894 0.0000620.19399 0.19475 0.0000010.18820 0.17958 0.000074

0.000261

pijdij SS

Least-Squares

0.286110.044360.015110.044630.050440.050380.084850.57588

Minumumevolution(ME)

LS branch lengths

Tree Searching Methods •Exhaustive search (exact) …predrag/classes/2004falli400/swafford.pdfTree...

Documents

COSC 3100 Brute Force and Exhaustive Search

COSC 3100 Brute Force and Exhaustive Search Instructor: Tanvir 1

PROBABILISTIC ANALYSIS OF AN EXHAUSTIVE SEARCH ALGORITHM IN RANDOM

Unifying Local and Exhaustive Search John Hooker Carnegie Mellon University September 2005

Key Exhaustive Search (Brute Force Attack)

How to easily find the optimal solution without exhaustive search using Genetic Algorithms

Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping

Protein structure determination by exhaustive search of ...Protein structure determination by exhaustive search of Protein Data Bank derived databases Ian Stokes-Reesa and Piotr Sliza,b,1

Appendix A. Exact Search Strings...A-1 Appendix A. Exact Search Strings PubMed® search strategy (January 12, 2016) #1 "Dystocia"[Mesh] OR "Dystocia"[tiab] OR "Dystocias"[tiab] OR

Exact and Approximate Reverse Nearest Neighbor Search for

Exhaustive Combinatorial Enumeration · Exhaustive Combinatorial Enumeration ... I Polyhedral enumeration techniques and algorithms. ... I Tabu search,

BubbleStorm: Resilient, Probabilistic, and Exhaustive Peer-to-Peer Search

Randomized partition trees for exact nearest neighbor search

Consensus Maximization for Semantic Region Correspondencesopenaccess.thecvf.com/content_cvpr_2018/papers/Speciale...the search paths during exhaustive search using the Branch-and-Bound

By Paul Doolittle Paul M. Doolittle, P.A.. Perform exhaustive, diligent job search Perform exhaustive, diligent job search If truly disabled, use Department

GPU-Accelerated Exhaustive Search for Third-Order …gac.udc.es/~jorgeg/publications/JoCS15.pdfGPU-Accelerated Exhaustive Search for Third-Order Epistatic Interactions in Case-Control

Binary search tree exact match - illustrated walkthrough

T K R COLLEGE OF ENGINEERING & TECHNOLOGY · 2019. 8. 27. · Hypothesis Space Search, ... To understand the notions of state space representation, exhaustive search, heuristic search

Exact and Approximate Reverse Nearest Neighbor Search for ...jessica/publications/rnn_sdm08.pdf · Exact and Approximate Reverse Nearest Neighbor Search for Multimedia Data ... We

Tabu Search and an Exact Algorithm for the Solutions of Resource