February 4 – 8, 2008 Centers for Disease Control, Atlanta, GA Workshop on Molecular Evolution: Special session on Phylogenetics

February 4 – 8, 2008February 4 – 8, 2008

Centers for Disease Centers for Disease

Control, Atlanta, GA Control, Atlanta, GA

Workshop on Molecular Workshop on Molecular

Evolution: Special session Evolution: Special session

on Phylogeneticson Phylogenetics

More data yields stronger analyses — if done carefully! The More data yields stronger analyses — if done carefully! The patterns of conservation become ever clearer by comparing the patterns of conservation become ever clearer by comparing the conserved portions of sequences amongst a larger and larger conserved portions of sequences amongst a larger and larger

dataset. Mosaic ideas and evolutionary ‘importance.’dataset. Mosaic ideas and evolutionary ‘importance.’

Multiple Sequence Multiple Sequence Alignment & Analysis Alignment & Analysis

with SeaView and MAFFTwith SeaView and MAFFTSteven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science (Computational Science (SCSSCS))

Tuesday, February 5, 2008, 2 to 6 PMTuesday, February 5, 2008, 2 to 6 PM

my lecture’s outlinemy lecture’s outlineThe Why —The Why — Applications: molecular phylogenetics;Applications: molecular phylogenetics;

primer design and graphics;primer design and graphics;

homology based inference.homology based inference.

The How —The How — Dynamic programming with just two sequences:Dynamic programming with just two sequences:

the recursion and scoring matrices.the recursion and scoring matrices.

The When —The When — Significance and the extreme value distribution:Significance and the extreme value distribution:

the Expectation value and homology.the Expectation value and homology.

The How again —The How again — Multiple sequence dynamic programming;Multiple sequence dynamic programming;

the algorithm and some of the variants:the algorithm and some of the variants:

Clustal, Muscle, ProbCons, T-Coffee, and MAFFT.Clustal, Muscle, ProbCons, T-Coffee, and MAFFT.

Do it on the Web, your own computer, a server.Do it on the Web, your own computer, a server.

Issues —Issues — Coding DNA versus protein sequences.Coding DNA versus protein sequences.

Reliability and all the complications involved.Reliability and all the complications involved.

How to cope —How to cope — SeaView: editing, visualization, and analysis.SeaView: editing, visualization, and analysis.

Before proceeding I need to remind you that the manuscript that we’ll be Before proceeding I need to remind you that the manuscript that we’ll be using for the tutorial this afternoon after I finish ‘yacking’ has most of this talk using for the tutorial this afternoon after I finish ‘yacking’ has most of this talk in greater detail and with all references (versus the PDF from the slides).in greater detail and with all references (versus the PDF from the slides).

Molecular evolutionary analysis; plusMolecular evolutionary analysis; plus

Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;

Graphical illustrations; andGraphical illustrations; and

Comparative ‘homology’ inference.Comparative ‘homology’ inference.

OK — here’s some examples.OK — here’s some examples.

First off, First off, why even bother why even bother — — Applicability?Applicability?

Molecular evolution and Molecular evolution and phylogeneticsphylogeneticsWe all know multiple sequence We all know multiple sequence

alignments are necessary for alignments are necessary for

phylogenetic inference, but does phylogenetic inference, but does

everybody here everybody here trulytruly realize that the realize that the

absolute positional homology of every absolute positional homology of every

column in a data matrix passed on to column in a data matrix passed on to

these programs is the most critical these programs is the most critical

assumption that all the algorithms assumption that all the algorithms

make (but see Bayesian coestimation)!make (but see Bayesian coestimation)!

And what about this other stuff?And what about this other stuff?

Multiple sequence alignments can be Multiple sequence alignments can be

indispensable for primer design when indispensable for primer design when

you don’t have data on a particular you don’t have data on a particular

taxa, yet data is available in related taxa, yet data is available in related

taxa. The conservation and taxa. The conservation and

variability within an alignment can variability within an alignment can

help guide the design of universal or help guide the design of universal or

species specific primers.species specific primers.

Here’s an HPV L1 exampleHere’s an HPV L1 example

The ellipses show areas where PCR primers could differentiate the Type 16 clade The ellipses show areas where PCR primers could differentiate the Type 16 clade from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red line) that correspond to areas of much weaker conservation in the others (blue line).line) that correspond to areas of much weaker conservation in the others (blue line).

Motif and profile definitionMotif and profile definitionAn alignment of human An alignment of human

SRY/SOX proteins SRY/SOX proteins

illustrates the illustrates the

conservation of the conservation of the

HMG box. Conserved HMG box. Conserved

regions can be regions can be

visualized with a sliding visualized with a sliding

window approach and window approach and

appear as peaks. appear as peaks.

Motifs and (better yet) Motifs and (better yet)

HMM profiles can be HMM profiles can be

created of the region to created of the region to

be used as a search be used as a search

tool to find other HMG tool to find other HMG

box proteins.box proteins.

HMG HMG boxbox

One picture’s worth . . .One picture’s worth . . .

The HMG-box domain is strikingly conserved amongst the otherwise The HMG-box domain is strikingly conserved amongst the otherwise nearly unalignable human DNA regulatory paralogous protein family.nearly unalignable human DNA regulatory paralogous protein family.

Structure/function homology inferenceStructure/function homology inference

A Swiss-Model A Swiss-Model

homology based homology based

model of model of GiardiaGiardia

EF1EF1 superimposed superimposed

over its eight most over its eight most

similar sequences similar sequences

with solved structure. with solved structure.

Amazingly accurate Amazingly accurate

structure/function structure/function

inferences are ofteinferences are often n

possible using possible using

comparative methods.comparative methods.

OK, so alignment is worthwhile. One way OK, so alignment is worthwhile. One way to ‘see’ an alignment between two to ‘see’ an alignment between two sequences is a dot plot, but how do we sequences is a dot plot, but how do we calculate the ‘best’ alignment?calculate the ‘best’ alignment?

So, first, let’s review pairwise alignmentSo, first, let’s review pairwise alignmentBrute force just won’t work, complexity Brute force just won’t work, complexity O ( ~NO ( ~N4N 4N ))

Dynamic programming reduces the complexity of this to Dynamic programming reduces the complexity of this to O ( ~NO ( ~N2 2 ))

An optimal alignment is defined as an arrangement of two An optimal alignment is defined as an arrangement of two

sequences,sequences,

1 of length 1 of length ii and 2 of length and 2 of length jj, such that:, such that:

SSii-1 -1 jj-1-1 or or

max Smax Si-xi-x j-j-11 + w + wx-x-11 or or

SSijij = s = sijij + max 2 < + max 2 < xx < < ii

max Smax Sii-1 -1 j-yj-y + w + wy-y-11

2 < 2 < yy < < ii

where Sij is the score for the alignment ending at i in sequence where Sij is the score for the alignment ending at i in sequence 1 and 1 and j in sequence 2,j in sequence 2,

sij is the score for aligning i with j,sij is the score for aligning i with j,

wx is the score for making a x long gap in sequence 1,wx is the score for making a x long gap in sequence 1,

wy is the score for making a y long gap in sequence 2,wy is the score for making a y long gap in sequence 2,

allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.

Usually an affine penalty is used: total = ( [ length of gap ] Usually an affine penalty is used: total = ( [ length of gap ] **

[ gap extension penalty ] ) + gap opening penalty, i.e. y = mx + b.[ gap extension penalty ] ) + gap opening penalty, i.e. y = mx + b.

An illustration of a simplified DP alignment exampleAn illustration of a simplified DP alignment example

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])

c T A T A t A a g g

c 1 0 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 1 1 T 0 1 0 1 0 1 0 0 0 0 A 0 0 1 0 1 0 1 1 0 0 t 0 1 0 1 0 1 0 0 0 0 A 0 0 1 0 1 0 1 1 0 0 a 0 0 1 0 1 0 1 1 0 0 T 0 1 0 1 0 1 0 0 0 0

Optimum AlignmentsOptimum AlignmentsThere may be more than one best path through the There may be more than one best path through the

matrix (and optimum doesn’t guarantee matrix (and optimum doesn’t guarantee

biologically correct). Starting at the top and biologically correct). Starting at the top and

working down, then tracing back, the two best working down, then tracing back, the two best

trace-back routes define the following two trace-back routes define the following two

alignments:alignments:

cTATAtAagg cTATAtAaggcTATAtAagg cTATAtAagg| ||||| and |||||| ||||| and |||||cg.TAtAaT. .cgTAtAaT.cg.TAtAaT. .cgTAtAaT.

With the example’s scoring scheme these alignments have a With the example’s scoring scheme these alignments have a

score of 5, the highest bottom-right score in the trace-back path score of 5, the highest bottom-right score in the trace-back path

graph, and the sum of six matches minus one interior gap. This is graph, and the sum of six matches minus one interior gap. This is

the number optimized by the algorithm, not any type of a similarity the number optimized by the algorithm, not any type of a similarity

or identity percentage, here 75% and 62% respectively! Software or identity percentage, here 75% and 62% respectively! Software

will report only one optimal solution.will report only one optimal solution.

This was a Needleman Wunsch global solution. Smith Waterman This was a Needleman Wunsch global solution. Smith Waterman

style local solutions use negative numbers in the match matrix style local solutions use negative numbers in the match matrix

and pick the best diagonal within the overall graph.and pick the best diagonal within the overall graph.

So, significance: when is any So, significance: when is any alignment worth anything biologically?alignment worth anything biologically?

An old statistics trick — An old statistics trick — Monte CarloMonte Carlo simulations: simulations:

Z scoreZ score = [ = [( actual score ) - ( mean of randomized scores )( actual score ) - ( mean of randomized scores )]]

(standard deviation of randomized score distribution)(standard deviation of randomized score distribution)

So, the previous solutions only get a Z score of 1.1 So, the previous solutions only get a Z score of 1.1

in spite of their seemingly high percent identities! in spite of their seemingly high percent identities!

Independent of optimum, what is a Independent of optimum, what is a ‘good’ alignment?‘good’ alignment?

And initially ‘we’ thought this was a Normal (Gaussian)

distribution. Now we know that it is actually an Extreme

Value distribution, the distribution of maximum scores,

not the distribution of mean scores.

Based on this known statistical Based on this known statistical

distribution, and robust distribution, and robust

statistical methodology, a statistical methodology, a

realistic realistic ExpectationExpectation function, function,

the the E ValueE Value, can be calculated , can be calculated

from database searches.from database searches.

The ‘take-home’ message is . . .The ‘take-home’ message is . . .

‘‘Sequence-space’ follows theSequence-space’ follows the

Extreme Value distribution((http://mathworld.wolfram.com/ExtremeValueDistribution.html).).

The Expectation Value!The Expectation Value!

The higher the E value is, the more probable The higher the E value is, the more probable

that the observed match is due to chance in a that the observed match is due to chance in a

search of the same size database, and the search of the same size database, and the

lower its Z score will be, i.e. is NOT significant.lower its Z score will be, i.e. is NOT significant.

Therefore, the smaller the E value, i.e. the Therefore, the smaller the E value, i.e. the

closer it is to zero, the more significant it is and closer it is to zero, the more significant it is and

the higher its Z score will be! The E value is the the higher its Z score will be! The E value is the

number that really matters.number that really matters.

Also see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-Also see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-

1.html1.html

And how does this relate to homology?And how does this relate to homology?

Significant ‘enough’ similarity implies homology, Significant ‘enough’ similarity implies homology, insignificant similarity does not negate homology.insignificant similarity does not negate homology.

And remember what “homology” And remember what “homology” reallyreally means (W. Fitch joke)! means (W. Fitch joke)!

The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some

particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments

(often the Normal distribution).(often the Normal distribution).

They They veryvery roughlyroughly correspond to the listed E Values (based on correspond to the listed E Values (based on

the Extreme Value distribution) for a typical the Extreme Value distribution) for a typical proteinprotein sequence sequence

similarity search through a database with ~125,000 protein similarity search through a database with ~125,000 protein

entries.entries.

What about proteins — conservative replacements What about proteins — conservative replacements and similarity as opposed to identity. The and similarity as opposed to identity. The nitrogenous bases, A,C, T, G, are either the same nitrogenous bases, A,C, T, G, are either the same or they’re not, but amino acids can be similar, or they’re not, but amino acids can be similar, genetically, evolutionarily, and structurally! genetically, evolutionarily, and structurally! BLOSUM62 table:BLOSUM62 table:

Positive identity values range from 4 to 11 and negative values for Positive identity values range from 4 to 11 and negative values for

those substitutions that rarely occur go as low as –4. The most those substitutions that rarely occur go as low as –4. The most

conserved residue is tryptophan with a score of 11; cysteine is next with conserved residue is tryptophan with a score of 11; cysteine is next with

a score of 9; both proline and tyrosine get scores of 7 for identity.a score of 9; both proline and tyrosine get scores of 7 for identity.

A B C D E F G H I K L M N P Q R S T V W X Y ZA 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

On to multiple sequences — dynamic On to multiple sequences — dynamic programming’s complexity increases programming’s complexity increases exponentially with the number of sequences exponentially with the number of sequences being compared:being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity complexity O ( [sequence length]O ( [sequence length]number of sequences number of sequences ))

See —See —

MSA (‘global’ within ‘bounding box’) and (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on the (‘local’ portions only) on the multiple alignment page at themultiple alignment page at the

Both available at the Baylor College of Both available at the Baylor College of Medicine’s Search Launcher —Medicine’s Search Launcher —

http://searchlauncher.bcm.tmc.edu/ — —

but, severely limiting restrictions!but, severely limiting restrictions!

A couple ‘global’ solutions using A couple ‘global’ solutions using heuristic tricksheuristic tricks

. . . restricts the . . . restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.

All sequences are All sequences are compared, pairwise, compared, pairwise, and then each is and then each is aligned to its most aligned to its most similar partner or group similar partner or group of partners represented of partners represented as a consensus. Each as a consensus. Each group of partners is group of partners is then aligned to finish then aligned to finish the complete multiple the complete multiple sequence alignment.sequence alignment.

Therefore — Therefore — pairwise, progressive pairwise, progressive dynamic programming . . .dynamic programming . . .

Enhancements on the themeEnhancements on the themeFirst enhancements came from First enhancements came from ClustalW — —

variable sequence weighting, dynamically variable sequence weighting, dynamically varying gap penalties and substitution varying gap penalties and substitution matrices, and a neighbor-joining guide-tree.matrices, and a neighbor-joining guide-tree.

Since the year 2000 a slew of new programs Since the year 2000 a slew of new programs have tried other heuristic variations, all in have tried other heuristic variations, all in attempts to build faster, more accurate attempts to build faster, more accurate multiple sequence alignments. The devil’s in multiple sequence alignments. The devil’s in the details: Muscle, ProbCons, T-Coffee, the details: Muscle, ProbCons, T-Coffee, MAFFT and many, many more.MAFFT and many, many more.

This was pretty much the original ClustalV This was pretty much the original ClustalV and GCG PileUp program . . . then . . .and GCG PileUp program . . . then . . .

MuscleAn iterative method that uses weighted log-expectation An iterative method that uses weighted log-expectation profile scoring along with a slew of optimizations. It profile scoring along with a slew of optimizations. It proceeds in three stages — draft progressive using k-proceeds in three stages — draft progressive using k-mer counting, improved progressive using a revised mer counting, improved progressive using a revised guide-tree from the previous iteration, and refinement guide-tree from the previous iteration, and refinement by sequential deletion of each tree edge with by sequential deletion of each tree edge with subsequent profile realignment.subsequent profile realignment.

ProbConUses Hidden Markov Model (HMM) techniques and Uses Hidden Markov Model (HMM) techniques and posterior probability matrices that compare random posterior probability matrices that compare random pairwise alignments to expected pairwise alignments. pairwise alignments to expected pairwise alignments. Probability consistency transformation is used to Probability consistency transformation is used to reestimate the scores, and a guide-tree is then reestimate the scores, and a guide-tree is then constructed, which is used to compute the alignment, constructed, which is used to compute the alignment, which is then iteratively refined. Incredibly accurate.which is then iteratively refined. Incredibly accurate.

T-CoffeeUses a preprocessed, weighted library of all pairwise Uses a preprocessed, weighted library of all pairwise global alignments between your sequences, plus the ten global alignments between your sequences, plus the ten best local alignments associated with each pair. This best local alignments associated with each pair. This helps build the NJ guide-tree and from this the alignment. helps build the NJ guide-tree and from this the alignment. The library is also used to assure consistency and help The library is also used to assure consistency and help prevent errors, by allowing ‘forward-thinking’ to see prevent errors, by allowing ‘forward-thinking’ to see whether the overall alignment will be better one way or whether the overall alignment will be better one way or another after particular segments are aligned one way or another after particular segments are aligned one way or another. The institutional schedule analogy . . . .another. The institutional schedule analogy . . . .

T-Coffee can even tie together multiple methods as T-Coffee can even tie together multiple methods as external modules, making consistency libraries from the external modules, making consistency libraries from the results of each, as long as all the specified methods are results of each, as long as all the specified methods are installed on your system. T-Coffee is one of the most installed on your system. T-Coffee is one of the most accurate methods available because of this consistency accurate methods available because of this consistency based rationale, but it is not the fastest. Regardless, I based rationale, but it is not the fastest. Regardless, I encourage you to check it out! Also see my manuscript.encourage you to check it out! Also see my manuscript.

MAFFT — today’s example— — has many modes, among them: a couple of progressive, has many modes, among them: a couple of progressive,

approximate modes, using a fast Fourier transformation approximate modes, using a fast Fourier transformation

(FFT); a couple of iteratively refined methods that add in (FFT); a couple of iteratively refined methods that add in

weighted-sum-of-pairs (WSP) scoring; and several iterative weighted-sum-of-pairs (WSP) scoring; and several iterative

methods that use WSP scoring combined with a T-Coffee-like methods that use WSP scoring combined with a T-Coffee-like

consistency based scoring scheme. Speed and accuracy are consistency based scoring scheme. Speed and accuracy are

inversely proportional for these from fast and rough, to slow inversely proportional for these from fast and rough, to slow

and accurate, respectively. and accurate, respectively.

MAFFT provides command aliases for all of these, from fast MAFFT provides command aliases for all of these, from fast

to slow — FFTNS with or without retree, FFTNSI with or to slow — FFTNS with or without retree, FFTNSI with or

without maxiterate, and the three combined approaches without maxiterate, and the three combined approaches

EINSI, LINSI, and GINSI.EINSI, LINSI, and GINSI.

See command line help with “See command line help with “mafft --helpmafft --help” and the ” and the

complete ‘man’ page style manual at complete ‘man’ page style manual at http://align.bmr.kyushu-

u.ac.jp/mafft/software/manual/manual.html.

MAFFT’s basic MAFFT’s basic algorithmMAFFT’s fast Fourier transform provide a huge speedup over MAFFT’s fast Fourier transform provide a huge speedup over previous methods. Homologous regions are quickly identified by previous methods. Homologous regions are quickly identified by converting amino acid residues to vectors of volume and polarity, converting amino acid residues to vectors of volume and polarity, thus changing a twenty-character alphabet to six, rather than by thus changing a twenty-character alphabet to six, rather than by using an amino acid similarity matrix. Similarly, nucleotide bases using an amino acid similarity matrix. Similarly, nucleotide bases are converted to vectors of imaginary and complex numbers. The are converted to vectors of imaginary and complex numbers. The FFT trick then reduces the complexity of the subsequent FFT trick then reduces the complexity of the subsequent comparison to comparison to O ( N logN )O ( N logN ). FFT identifies potential similarities . FFT identifies potential similarities though, without localizing them; a sliding window step using the though, without localizing them; a sliding window step using the BLOSUM62 matrix is used for this.BLOSUM62 matrix is used for this.

Then MAFFT constructs a distance matrix, and hence a Then MAFFT constructs a distance matrix, and hence a progressive guide tree, on the number of shared six-tuples from progressive guide tree, on the number of shared six-tuples from this Fourier transform, rather than on a ranking based on full-this Fourier transform, rather than on a ranking based on full-length, pairwise sequence similarity. The user can specify how length, pairwise sequence similarity. The user can specify how many times a new guide tree is subsequently recalculated from a many times a new guide tree is subsequently recalculated from a previous alignment as many times as desired; the alignment is previous alignment as many times as desired; the alignment is reconstructed using the Needlman Wunsch algorithm each time.reconstructed using the Needlman Wunsch algorithm each time.

Some of MAFFT’s many modesSome of MAFFT’s many modesAnd each mode has a bunch of additional options!

1) Most basic, fastest modes — just progressive.

a) FFTNS1 (fftns --retree 1)

b) FFTNS2 (fftns) (same as mafft --retree 2)

Suitable for 1,000’s of easily aligned sequences.

A rough distance matrix is built from the sequences using FFT and the shared number of six-mers.

A modified UPGMA guide tree is built from this matrix.

The sequences are aligned according to the rough, initial guide tree (as in ‘traditional’ methods).

FFTNS2 adds a recomputation of the guide tree (retree 2) from the original alignment, from which a new progressive alignment is built.

MAFFT’s iterative refinementsMAFFT’s iterative refinements2) Intermediate modes — progressive + iterations

to maximize the WSP objective function.

a) FFTNSI (fftnsi) default two cycles, or e.g.

fftnsi --maxiterate 1000

b) NWNSI (nwnsi) same as FFTNSI, but no FFT, Needleman Wunsch only.

Progressive alignment and retree as before, with or without FFT, and then . . . .

Iterative refinement is cycled twice (default), or repeatedly until there is no further improvement, or until you reach your specified limit number.

Suitable for 100’s through 1000’s of sequences.

MAFFT’s most accurate modesMAFFT’s most accurate modes3) Advanced modes — progressive + iterations to

maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.

a) EINSI (einsi) most general of these.

Uses a Smith Waterman style local algorithm with generalized affine gap costs for the pairwise step. Most appropriate for sequences with multi-, shared, similarly ordered domains, in an otherwise nearly unalignable ‘mess,’ e.g:

ooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX----------------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo--ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

MAFFT’s most accurate modes, cont.MAFFT’s most accurate modes, cont.

3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.

b) LINSI (linsi) strictly local.

Uses a Smith Waterman style local algorithm with affine gap costs for the pairwise step. Most appropriate for sequences with only one single, shared domain, in an otherwise nearly unalignable ‘mess,’ .e.g:

--------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo--------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo--------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo--------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

MAFFT’s most accurate modes, cont.MAFFT’s most accurate modes, cont.

3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.

c) GINSI (ginsi) strictly global.

Uses a Needleman Wunsch style global algorithm with affine gap costs for the pairwise step. Most appropriate for sequences where only one single, shared domain extends the full length of all of the sequences, .e.g:

XXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXX-XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXX-XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXXXX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXXXX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXXooooXXXXXoooooXXXXX-XXXXXXXXXXXX--XXXXXXXX-ooooXXXXXoooooXXXXX-XXXXXXXXXXXX--XXXXXXXX-XXXXX---XXXXXXXXXX--XXXXXXXooooXXXXXXXXXX--XXXXX---XXXXXXXXXX--XXXXXXXooooXXXXXXXXXX--

How to know when to use whatHow to know when to use what

for all of them — Take home message:for all of them — Take home message:

For For simplesimple casescases itit doesn’tdoesn’t reallyreally mattermatter what what

program to use. For complicated situations it may, program to use. For complicated situations it may,

and what you use will depend on the size of your and what you use will depend on the size of your

dataset, personal preferences, time allotted, and dataset, personal preferences, time allotted, and

how much hand editing you want to do.how much hand editing you want to do.

Really nice, recent review: Edgar, R.C. and Really nice, recent review: Edgar, R.C. and

Batzoglou, S. (2006) Multiple sequence alignment. Batzoglou, S. (2006) Multiple sequence alignment.

Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 1616, 368–373., 368–373.

The rest of my references can be found in my The rest of my references can be found in my

tutorial manuscript for this workshop.tutorial manuscript for this workshop.

for MAFFT — see the algorithm and tips, tips3, and tips4 pages;

You can do a lot of this stuff on the You can do a lot of this stuff on the Web, if you need to — some resources Web, if you need to — some resources for multiple sequence alignment:for multiple sequence alignment:

http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html..

http://pbil.univ-lyon1.fr/alignment.html

http://www.ebi.ac.uk/Tools/sequence.html

http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasets However, problems with very large datasets and huge multiple alignments make doing and huge multiple alignments make doing multiple sequence alignment on the Web multiple sequence alignment on the Web impractical after your dataset has reached a impractical after your dataset has reached a certain size. You’ll know it when you’re there!certain size. You’ll know it when you’re there!

If large datasets become intractable for If large datasets become intractable for analysis on the Web, what other resources analysis on the Web, what other resources are available? Soapbox detour . . .are available? Soapbox detour . . .

Desktop software solutions — all of these programs Desktop software solutions — all of these programs are available in public domain/open source, but . . . are available in public domain/open source, but . . . they can be complicated to install, configure, and they can be complicated to install, configure, and maintain. User must be pretty computer savvy.maintain. User must be pretty computer savvy.

So, commercial software packages are available, e.g. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, but . . . license hassles, big expense per machine, lack of most recent programs, underperformance, lack of most recent programs, underperformance, and Internet and/or CD database access all and Internet and/or CD database access all complicate matters!complicate matters!

Therefore, I argue for UNIX server-Therefore, I argue for UNIX server-based solutions . . .based solutions . . .

UNIX servers — pros and cons

Free/public domain solutions still available, but now a Free/public domain solutions still available, but now a very cooperative systems manager needs to very cooperative systems manager needs to maintain everything for users. If you have such a maintain everything for users. If you have such a person, then:person, then:

You end up with a more powerful, and usually faster You end up with a more powerful, and usually faster computer, with larger storage capabilities. Plus, computer, with larger storage capabilities. Plus, connections can be made from any networked connections can be made from any networked terminal or workstation anywhere!terminal or workstation anywhere!

Operating system:Operating system: UNIX command line operation UNIX command line operation hassles; communications software — ssh, and hassles; communications software — ssh, and terminal emulation; X graphics; file transfer — terminal emulation; X graphics; file transfer — scp/sftp; and editors — vi, emacs, pico/nano (or scp/sftp; and editors — vi, emacs, pico/nano (or desktop word processing followed by file transfer desktop word processing followed by file transfer [save as "text only!"]). See my supplement pdf file.[save as "text only!"]). See my supplement pdf file.

getting off my soapbox . . .getting off my soapbox . . .

Coding DNA issuesCoding DNA issuesWork with proteins! If at all possible.Work with proteins! If at all possible.Twenty match symbols versus four, plus Twenty match symbols versus four, plus

similarity versus identity!similarity versus identity!

Way better signal to noise.Way better signal to noise.

Also guarantees no indels are placed within Also guarantees no indels are placed within codons. So translate, then align. SeaView codons. So translate, then align. SeaView can do this for you!can do this for you!

Nucleotide sequences will only reliably align Nucleotide sequences will only reliably align if they are if they are veryvery similarsimilar to each other. And to each other. And they will likely require extensive and they will likely require extensive and carefully considered hand editing with an carefully considered hand editing with an editor like SeaView.editor like SeaView.

Reliability and the comparative Reliability and the comparative approach . . .approach . . .explicit homologous correspondence of explicit homologous correspondence of

residues within every column of your residues within every column of your alignment;alignment;

manual adjustments should be encouraged — manual adjustments should be encouraged — based on knowledge,based on knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeaView andTherefore, editors like SeaView and

databases like the Ribosomal Database databases like the Ribosomal Database Project: Project: http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp

SeaViewSeaViewSeaView is a really good multiple sequence SeaView is a really good multiple sequence

editor graphical user interface (GUI) with the editor graphical user interface (GUI) with the ability to manually adjust alignments, create ability to manually adjust alignments, create dot plots between any two sequences, and run dot plots between any two sequences, and run external multiple sequence alignment external multiple sequence alignment programs on portions of or all of your data.programs on portions of or all of your data.

Some of its very powerful features are it’s Some of its very powerful features are it’s ability to allow you to work on DNA sequences ability to allow you to work on DNA sequences based on their translations, to create “Sites” based on their translations, to create “Sites” and “Species” sets that delineate subsets of and “Species” sets that delineate subsets of your data matrix, and to annotate your data. It your data matrix, and to annotate your data. It is available for all major operating systems. is available for all major operating systems.

SeaView’s view

The HPV E2/E4 gene reading frame overlap after EINSI The HPV E2/E4 gene reading frame overlap after EINSI refinement.refinement.

‘‘Mask’ out uncertain areas; SeaView’s “Sites Mask’ out uncertain areas; SeaView’s “Sites sets” allows you to do this. Annotate known sets” allows you to do this. Annotate known regions; SeaView’s “Footers” do this.regions; SeaView’s “Footers” do this.

X’s delineate sites that will be exported with “Save selection” or X’s delineate sites that will be exported with “Save selection” or specified as a CHARSET by “Save as” “NEXUS.”specified as a CHARSET by “Save as” “NEXUS.”

Complications: beware of aligning Complications: beware of aligning apples and oranges apples and oranges [[and grapefruitand grapefruit]]!!

For example: For example: receptors and/or receptors and/or activators with their activators with their namesake proteins;namesake proteins;

parologous versus parologous versus orthologous homologs;orthologous homologs;

genomic versus cDNA;genomic versus cDNA;

mature versus mature versus precursor proteins . . . .precursor proteins . . . .

Complications, cont.Complications, cont.Order dependence.Order dependence.

Not that big of a deal, makes Not that big of a deal, makes biological sense.biological sense.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

Can be a very big deal!Can be a very big deal!

Regional ‘realignment’ becomes Regional ‘realignment’ becomes incredibly important, especially with incredibly important, especially with sequences that have areas of high and sequences that have areas of high and low similarity. SeaView let’s you do this!low similarity. SeaView let’s you do this!

Complications still,Complications still,format hassles!format hassles!

Specialized format conversion tools Specialized format conversion tools such as GCG’s SeqConv+ program such as GCG’s SeqConv+ program and Don Gilbert’s public domain and Don Gilbert’s public domain ReadSeqReadSeq program. program.

And many editors accept various And many editors accept various input formats, such as SeaView (inputs input formats, such as SeaView (inputs and outputs NEXUS, MSF, Clustal, and outputs NEXUS, MSF, Clustal, FastA, PHYLIP, and MASE formats).FastA, PHYLIP, and MASE formats).

Yet more complicationsYet more complications

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

FOR MORE INFO...FOR MORE INFO...

Explore my Web Home: Explore my Web Home: http://bio.http://bio.fsufsu..edu/~stevet/cvedu/~stevet/cv.html.html..

Contact me (Contact me (stevet@[email protected]) for specific long-distance ) for specific long-distance bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.

Gunnar von Heijne in his very old but quite readable treatise, Gunnar von Heijne in his very old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Sequence Analysis in Molecular Biology; Treasure Trove or Trivial PursuitPursuit, provides a very appropriate conclusion:, provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”above all we have to become better biologists. But that’s all it takes.”

ConclusionsConclusions

On to a demonstration of some On to a demonstration of some

of SeaView’s multiple sequence of SeaView’s multiple sequence

dataset capabilities —dataset capabilities —

The HPV L1 gene and its The HPV L1 gene and its

complete genome . . . the complete genome . . . the

tutorial:tutorial:

How to use SeaView with How to use SeaView with

MAFFT.MAFFT.

Documents

February 4 – 8, 2008 Centers for Disease Control, Atlanta, GA Workshop on Molecular Evolution: Special session on Phylogenetics