Guide Trees and Progressive Multiple Sequence Alignment

JAMES A. FOSTERAnd Luke Sheneman

1 October 2008

INITIATIVE FOR BIOINFORMATICS AND EVOLUTIONARY STUDIES (IBEST)

Guide Trees and Progressive Multiple Sequence Alignment

Multiple Sequence Alignment

Abstract representation of sequence homologyHomologous molecular characters

(nucleotides/residues) organized in columnsGaps (-) represent sequence indels

Multiple Sequence Alignment

Many bioinformatics analyses depend on MSA.

First step in inferring phylogenetic trees MSA technique is at least as important as inference

method and model parameters (Morrison & Ellis, 1997)

Structural and functional sequence analyses

Progressive Alignment

Idea: align “closely related” sequences first, two at a time with “optimal” subalignments (dynamic programming)

Problem: once a gap, always a gapAdvantage: fast

Guide Trees and Alignment Quality

How important is it to find “good” guide trees?

How much time should be spent looking for “better” guide trees?

Hypothesis

Guide trees that are closer to the true phylogeny lead to better sequence alignments

Guide trees that are further from the true tree produce less accurate alignments.

The effect is measurable.

The correlation is significant.

Previous Work

Folk wisdom, intuition: it matters, a lot Basis for Clustal, and most other pMSA implementations

Nelesen et al. (PSB ’08): doesn’t matter, much No strong correlation No large effect

Edgar (2004): bad trees are sometimes better UPGMA guide trees ultrametric but outperform NJ

Experimental Design: strategy

For both natural data and simulation data, with reliable alignments and phylogenies:

Explore the space of possible guide trees, moving outward from the “true tree” Use each tree as a guide tree, perform pMSA Compare quality of resulting alignment with known

optimal value

Experimental Design: Naturally Evolved Case

Experimental Design: Degrading Guide Trees

Random Nearest Neighbor Interchange (NNI) Swaps two neighboring internal branches

• Random Tree Bisect/Reconnect (TBR)• Randomly bisect tree• Randomly reconnect two trees

Images: hyphy.org

TreeBASE (“natural”) Input Datasets

Experimental Design: Simulated Evolution Case

Conclusions

Statistically significant correlation between guide tree quality and alignment quality Independent of tree transformation operator Independent of alignment distance metric

But very small absolute change in qualityNon-linear / logarithmic

Largest alignment quality effect 5-10 steps from phylogeny

The lesson: it helps to improve a really good guide tree, otherwise it helps but only a

little

Acknowledgements

Dr. Luke Sheneman (mostly his slides!)

Faculty, staff, and students of BCB Jason Evans Darin Rokyta

Funding sources: NIH P20 RR16454 NIH NCRR 1P20 RR16448 NSF EPS 00809035

Experimental Design: metrics

Â =pmsa(S, T) where S is the set of input sequences where T is the guide tree (hidden parameters: pairwise algorithm, tie breaking

strategy)AQ = CompareAlignments(A*, Â)

QSCORE (A*, Â) -> TC-error, SP-error Nelesen had a nicer metric: error of estimated

phylogenyTdist = TreeDistance(T*, T)

Upper bound estimate of edit distance via NNI or TBR

Alternative Scoring metric

Idea: “quality” of an alignment is distance from the phylogeny it produces to the “true” phylogeny

AQ = KTreeDist(ML_est(A*),ML_est( Â)) ML_est(A): max likelihood estimate of the phylogeny

behind MSA A (we used RAXML) KTreeDist(T1,T2): scales T2 to T2, measures Branch

Length Distance (Sorio-Kurasko et al. 07; Kuhner & Felsenstein 94)

Data sets: from L1 sequences in mammals, bats, humans, hand aligned A*

All methods pretty are good

0.000

0.200

0.400

0.600

0.800

1.000K tree distance (ML to true)

hardMegabatXISTMisc

Clustal, Mafft (f), Mafft (s), Muscle (f), Muscle (s)

Documents

Guide Trees and Progressive Multiple Sequence Alignment