28
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy [email protected] Sep 23rd, 2014

Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 23rd, 2014

Embed Size (px)

Citation preview

Page 1: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Practical multiple sequence algorithms

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/Sushmita Roy

[email protected] 23rd, 2014

Page 2: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

RECAP

• Scores for multiple sequence alignment– Sum of pairs– Minimum entropy based

• Heuristic algorithms for performing multiple sequence alignment– Progressive

• Star alignment• Guide tree-based

– ClustalW

– Iterative• MUSCLE

Page 3: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Goals for today

• General description of iterative algorithms• A practical implementation

– MUSCLE

Page 4: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Iterative algorithms for multiple sequence alignment

• Key idea: revisit the alignments• Algorithms vary depending upon how exactly the

alignments are changing between iterations

Page 5: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Simple iterative algorithm (Also called the Barton-Sternberg alignment algorithm)

1. Align two sequences with highest alignment score using standard dynamic programming techniques for pairwise alignment

2. Repeat until all sequences are in the alignment– Find the sequence most similar to current alignment– Add to alignment.

3. For all sequences xi,– Remove xi from alignment, re-align to the partial alignment of {x1...xn}\

xi.

• Repeat 3 until the score does not improve OR we have executed a fixed number of steps

Page 6: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

MUSCLE: Multiple Sequence Comparison by log-expectation

• Progressive + iterative• Has three main stages• Stage1: Draft Progressive• Stage 2: Improved Progressive• Stage 3: Refinement:

– Select pairs of subtrees and re-align the alignment for the subtrees.

– Keep if it improves alignment

• Each stage returns an alignment– Could be terminated anywhere

Page 7: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Steps in MUSCLE

Stage 1: Draft progressive

Stage 2: Improved progressive

Stage 3: Refinement

Page 8: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

MUSCLE Stage 1

1.1 Compute k-mer distance matrix

1.2 Use UPGMA to make tree (TREE1) (We will see this in a bit)

1.3. Use guide tree to make first MSA

Page 9: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

K-mer distance D

• K-mer distance is defined from common fractional k-mer count (F)

• For two sequences x and y

• D=1-F

Page 10: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

K-mer distance example

Sequence k=2-mers

AKFLA AK,KF, FL,LA

LKFLFL LK, KF, FL,LF,FL

K-mer (τ) nx(τ) ny(τ) min(nx(τ), ny(τ))

AK 1 0 0

KF 1 1 1

FL 1 2 1

LA 1 0 0

LK 0 1 0

LF 0 2 0

x

y

Page 11: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Stage 2: Improved progressive

2.1 Recompute similarity of sequences of pairs using mutual alignment in MSA

2.2 Construct a phylogenetic tree (TREE2) using an alignment-based distance

2.3 Build a new progressive alignment only for subtrees where branching order has changed between TREE1 and TREE2

2.4 Repeat 2.3 until number of “reordered nodes” does not decrease.

Page 12: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Stage 2.1. Recomputing pairwise sequence similarity from a multiple alignment

-TGTTAAC-TGT-AAC-TGT--ACATGT---CATGT-GGC

An MSATGTTAACTGT-AAC

TGTTAACTGT--AC

-TGTTAACATGT---C

-TGTTAACATGT-GGC

Derived pairwise alignment Fraction identity

6/7

5/7

4/8

4/8

Exclude gaps in both sequences

Page 13: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Stage 2.2: Phylogenetic tree creation

Construct a phylogenetic tree using a Kimura distance

D: fractional identity of sequences

Page 14: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Stage 2.3 Re-align only when branching order is changed

Branching order same

Branching order different:x branches before v

Recompute alignment for these nodes

Page 15: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Stage 3: Iterative Refinement

3.1 Delete an edge3.2 Extract profiles from subtrees3.3 Re-align profiles3.4 Update MSA if its score is better than current MSA

Page 16: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

3.1 Selecting a branch

• Select a branch in order of decreasing distance from the root

MQTIFLH-IW

LQSW

MQTIF

LHIW

LSF

LQSWL-SW

1

2

3

4

5

6

Branch selection order: 1,2,3,4,5,6

MQTIFLH-IWLQS-WL-S-W

Page 17: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

3.2 Extracting a profile

MQTIFLH-IW

LQSW

LHIWMQTIFLH-IWLQS-WL-S-W

LSF

LQSWL-SW

2

3

4

5

6

Delete branch 2

Re-align profiles for subtrees

MQTIFLQS-WL-S-W

Is score better?

yes

Keep new alignment

Discard

MQTIF LHIW

LHI-WMQTIFLQS-WL-S-W

1

Page 18: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Summary of MUSCLE

• Three stage algorithm• Stage 1: Draft progressive

– k-mer distance– UPGMA tree (TREE1)– Guide tree based alignment (MSA1)

• Stage 2: Improved progressive– Distance derived from MSA1 – UPGMA tree (TREE2)– Redo alignment for nodes with changed orderings– Repeat until number of re-ordered nodes does not change

• Stage 3: Iterative refinement– Generate subtree profiles– Realign profiles– Keep realignment if of higher score– Repeat until no more improvement or fixed number of steps.

• MUSCLE-fast: Stage 1• MUSCLE-p: Stage1 and 2

Note different convergence criteria in Stages 2 and 3

Page 19: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Accuracy scores of different MSA algorithms on benchmark datasets

Edgar, 2004, BMC Bioinformatics

Accuracy measures the fraction of residues correctly aligned with the reference alignment

Page 20: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Run time of different MSA algorithm

Page 21: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Summary of algorithms

• ClustalW– Lots of heuristics for gaps– One guide tree and then alignment– Weights sequences– Dynamically selects scoring matrix depending upon sequence identity

• MUSCLE– Three-stage algorithm: Draft, Improved, Iterative refinement– Two guide trees– Uses k-mer distance for first tree– Selectively re-aligns using second tree– Refines iteratively by working on subtree-associated alignments– Fast and has as good or better quality alignments

Page 22: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

How do MUSCLE and CLUSTALW work in practice

• Consider coding sequences of 15 yeast species• Consider promoter sequences of 15 yeast species• Align with MUSCLE and CLUSTALW

Page 23: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Protein sequence alignment

MUSCLE

CLUSTALW

Page 24: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Promoter sequence alignment

MUSCLE

CLUSTALW

Page 25: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Comparing alignment of promoters to shuffled sequences in CLUSTALW

Original sequences

Shuffled sequences

Page 26: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Comparing alignment of promoters to shuffled sequences in MUSCLE

Original sequences

Shuffled sequences

Page 27: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Conclusion

• Algorithms seemed similar for protein/coding sequences

• Algorithms gave different alignments for DNA sequence– Possibly DNA sequence is harder to align– DNA sequence in non-coding regions are even harder to

align

Page 28: Practical multiple sequence algorithms Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 23rd, 2014

Summary of sequence alignment

• Pairwise alignment– Algorithms

• Global: (Needleman-Wunsch) • Local: (Smith-Waterman)• Heuristic search to align large number of sequences

– BLAST

• Multiple sequence alignment– Star alignment– Progressive alignment with guide tree: CLUSTALW– Progressive + Iterative alignment with guide tree: MUSCLE