Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

Design and creation of Design and creation of multiple sequence multiple sequence

alignmentsalignmentsUnit 13Unit 13

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

Dot Plot (Matrix) for Dot Plot (Matrix) for Sequence comparisonSequence comparison

Reminders from Previous Reminders from Previous LecturesLectures

DOTPLOTSDOTPLOTS

D

OR

OT

HY

HO

DG

KIN

DO

RO

TH

YH

OD

GK

IN

DOROTHYCROWFOOTHODGKINDOROTHYCROWFOOTHODGKIN

A T G C A G T TA T G C A G T T

Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison

A T G C A G T TA T G C A G T T


Identity diagonal

A T G C A T G CA T G C A T G C



Identity diagonal

Direct Repeat

A T G C A T G CA T G C A T G C

A T G G A G T TA T G CA G T T

Dot Matrix: Point Dot Matrix: PointMutationMutation

Dot Matrix: Point Dot Matrix: PointMutationMutation

Main diagonal

Point mutation

A T G G A G T TA T G CA G T T

A T G A A C A G T TA T G C A G T T

Dot Matrix: Ga Dot Matrix: Gapp

Dot Matrix: Ga Dot Matrix: Gapp

Main diagonal

Deletion/Insertion

A T G A A C A G T TA T G C A G T T

A G T T A T G CA T G C A G T T

Dot Matrix: Rearra Dot Matrix: Rearrangementngement

Dot Matrix: Rearra Dot Matrix: Rearrangementngement

Main diagonal

A G T T A T G CA T G C A G T T

Dot Plot Analysis Dot Plot Analysis

AdvantagesAdvantages Simple and fast. Simple and fast. Can detect DNA rearrangemen Can detect DNA rearrangemen

tt DisadvantagesDisadvantages

No numerical values produced No numerical values produced Subjective interpretation Subjective interpretation

Problems of Sequenc Problems of Sequenc e Alignment e Alignment

How to score? Match, Mismat How to score? Match, Mismat ch and Gap ch and Gap

Example: Example: +1 for each match +1 for each match , , 0for mismatch 0for mismatch and and - 2 for each- 2 for each

internal gap internal gap ( (ggg gggggggggg ggggggg g), 0 g), 0 or terminal gap ( or terminal gap ( similarity sco similarity sco

gggggggg

Computational measuresComputational measures

Distance measureDistance measure 0 for a match0 for a match 1 for a mismatch or gap1 for a mismatch or gap Lowest bestLowest best

Another measureAnother measure 2 for a match2 for a match -1 for a mismatch, -2 for a gap-1 for a mismatch, -2 for a gap highest besthighest best

Gap PenaltiesGap Penalties

Gap penaltiesGap penalties Linear score f(g) = - gdLinear score f(g) = - gd Affine score f(g) = - d – (g-1) eAffine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend d = gap open penalty e = gap extend penaltypenalty

g = gap lengthg = gap length

Example Gap penalty values used:Example Gap penalty values used: d = 500 d = 500 e = 50e = 50

Example from Lab-Example from Lab-Feb20:Feb20:

-1 for terminal gap, -2 for -1 for terminal gap, -2 for for each internal gap for each internal gap ( ( gap penalty gap penalty)) Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,A) = 4; Blosum(A,P) = -1;

Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4

AAWWAAPP -1-1-3-1+7=2 -3-1+7=2 (one terminal gap, 2 (one terminal gap, 2 mismatches)mismatches)

- - AAPPPP AWAWAPAP - - -3-3+4+7=8 +4+7=8 (3 terminal gaps, no mismatches)(3 terminal gaps, no mismatches)

- -- -APAPP best if gap penalty (inside) is highP best if gap penalty (inside) is high AWAP AWAP -2-2+4-1+7=8 +4-1+7=8 (one internal gap, 1 mismatch)(one internal gap, 1 mismatch)

A - PP best if terminal gap is highA - PP best if terminal gap is high

How to find the alig How to find the alig nment with the best nment with the best

score? score?

Finding alignment wi Finding alignment wi th best score th best score

Brute force approach Brute force approach = calculat = calculat ing scores of all possible alignm ing scores of all possible alignm ent and select the best ones. ent and select the best ones.

For For -two 1000 bp -two 1000 bp DNA sequence DNA sequence , the number of possible alignm , the number of possible alignm

ent is ent is1010600600 . Brute force appro . Brute force appro ach is impossible. ach is impossible.

Dynamic programmin Dynamic programmin g Methods g Methods

Finding the best alignment without Finding the best alignment without calculating all possible alignment. calculating all possible alignment.

The method is The method is EXACTEXACT.. Original method by Original method by Needleman&WNeedleman&W

unschunsch performs performs global alignment global alignment.. Modification by Modification by Smith&WatermanSmith&Waterman

performs performs local alignment local alignment..

A T G A A C A G T TA 1 -1 -3 -5 -7 -9 -11 -13 -15 -17T -1 2 0 -2 -4 -6 -8 -10 -12 -14G -3 0 3 1 -1 -3 -5 -7 -9 -11C -5 -2 1 3 1 0 -2 -4 -6 -8A -7 -4 -1 2 4 2 1 -1 -3 -5G -9 -6 -3 0 2 4 2 2 0 -2T -11 -8 -5 -2 0 2 4 2 3 1T -13 -10 -7 -4 -2 0 2 4 2 4

Needleman&Wunsch Needleman&Wunsch Methods (match=1, m Methods (match=1, m

-ismatch=0 , gap= 2 ) -ismatch=0 , gap= 2 )

Local Alignment wit Local Alignment wit - h Smith Waterman - h Smith Waterman

AlgorithmAlgorithm Adding one modification: Any n Adding one modification: Any n

egative score are changed to egative score are changed to 0 . That is alignment will not b 0 . That is alignment will not b

e done unl ess t he score i s po e done unl ess t he score i s posi t i vesi t i ve

A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5

- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis

-match=0 , gap= 2 ) -match=0 , gap= 2 )

A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5

- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis

-match=0 , gap= 2 ) -match=0 , gap= 2 )

Scoring scheme Scoring schemess

Although dynamic programming g Although dynamic programming g uarantee correct results for each uarantee correct results for each

scoring scheme. The biological b scoring scheme. The biological b asis of scoring scheme is weak, e asis of scoring scheme is weak, e

xcept for the fact that insertion/d xcept for the fact that insertion/d eletion is rarer than substitution eletion is rarer than substitution

s and scored accordingly s and scored accordingly

-Match Mismatc-Match Mismatc h score h score

DNADNA Transition is more frequent than transv Transition is more frequent than transv

ersion ersion (e.g., for (e.g., for M. tuberculosisM. tuberculosis SNP ~ SNP ~ 2:1)2:1) and can be scored accordingly and can be scored accordingly..

In practice base transition and In practice base transition and transversion are usually scored equally.transversion are usually scored equally.

ProteinsProteins Substitution matrix such as PAM or Substitution matrix such as PAM or

BLOSUMBLOSUM

Transitions & Transitions & TransversionsTransversions

Transition: A nucleotide substitution Transition: A nucleotide substitution from one purine to another purine from one purine to another purine (eg, A->G), or from one pyrimidine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C).to another pyrimidine (eg, T->C).

Transversion: A nucleotide Transversion: A nucleotide substitution from a purine to a substitution from a purine to a pyrimidine (eg, A->C), or vice versa pyrimidine (eg, A->C), or vice versa (eg, T->G).(eg, T->G).

Transitions & Transitions & TransversionsTransversions

PurinesPurines Pyrimidines

Gap penalty Gap penalty

Linear model Linear model = = kk Affine model Affine model = = 00 + + k, k, 00 = = gap opening gap opening

penalltypenallty , , k= k= gap extension penalty gap extension penalty . .00

More biologically realistic model More biologically realistic modelss needneed e e xponentially decrease gap penalty functi xponentially decrease gap penalty functi

ons such as ons such as 00 + + Logk. C Logk. C omputational omputational

complexity prohibits its common use. complexity prohibits its common use.

More advance scoring sys More advance scoring systemtem

Position dependent scores, use di Position dependent scores, use di fferent matrix (and penalty) at dif fferent matrix (and penalty) at dif

ferent position in proteins. Funct ferent position in proteins. Funct ional importance of protein regio ional importance of protein regio

ns affect divergence ns affect divergence Structure dependent scores. Structure dependent scores.

Software providing Software providing ALIGNMENT toolsALIGNMENT tools

MATLAB: Bioinformatics toolboxMATLAB: Bioinformatics toolbox

[GlobalScore, GlobalAlignment] = [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... nwalign(humanProtein,... mouseProtein) mouseProtein)

… … swalignswalign

showalignment(GlobalAlignment) showalignment(GlobalAlignment)

ORACLE 10g BLAST functions: blastn, ORACLE 10g BLAST functions: blastn, blastp, blastx, etc blastp, blastx, etc

Types of AlgorithmsTypes of Algorithms

Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.

In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Dynamic Programming The algorithm for finding optimal alignments

given an additive alignment score dynamically These type of algorithms are guaranteed to find

the optimal scoring alignment or set of alignments.

HMM - Based on Probability Theory – very versatile.

http://www.soe.ucsc.edu/http://www.soe.ucsc.edu/research/compbio/HMM-research/compbio/HMM-

apps/HMM-apps/HMM-applications.htmlapplications.html

Hidden Markov M Hidden Markov M odel (HMM) odel (HMM)

Markov chain Markov chain Chain of events, in which the Chain of events, in which the prpr

obability of each event obability of each event depend dependss only on only on aa preceding event preceding event..

Assumption: Assumption: DNA can be viewed DNA can be viewed as a Markov chain as a Markov chain . Probability o . Probability o

f A, T, G, or C appearing in each f A, T, G, or C appearing in each position depend on kind of nucle position depend on kind of nucle

otide in the preceding position. otide in the preceding position.

Markov chain is defi Markov chain is defi ned by ned by

P(A|A) = probability of a base be P(A|A) = probability of a base be ing A if the preceding base is A. ing A if the preceding base is A.

P(T|G) = probability of a base be P(T|G) = probability of a base be ing T if the preceding base is G. ing T if the preceding base is G.

And so on. And so on. So a DNA Markov So a DNA Markov chain is defined by 16 chain is defined by 16 probabilities.probabilities.

Markov Chain Model of DNA. E Markov Chain Model of DNA. E ach arrow is defined by a transit ach arrow is defined by a transit

ion probability. ion probability.

A G

T C

Hi ddenMarkov Model Hi ddenMarkov Model

HiddenHidden : State path e.g., : State path e.g.,NNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNN

Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgatctg

The question is to find the The question is to find the most pr most pr obable (hidden) state path obable (hidden) state path when th when th

- e (non hidden) sequence is known. - e (non hidden) sequence is known.

Algorithm to find Most Pr Algorithm to find Most Pr obable State Path (Decodi obable State Path (Decodi

ng)ng)

If parameters are known, If parameters are known, Viterbi algorithm Viterbi algorithm..Posterior decodingPosterior decoding

Esti mati onof parameters Esti mati onof parameters

Usually a “training set” of Usually a “training set” of sequences are required. sequences are required.

The “training set” may beThe “training set” may be Sequences of known stateSequences of known state Sequences of unknown state. Sequences of unknown state.

Parameters are arbitrarily set and Parameters are arbitrarily set and reiterated until state changes are reiterated until state changes are minimal.minimal.

HMM HMM for identifying for identifying coding coding DNA Sequences DNA Sequences

A G

T C

A G

T CCoding (exon) -Non Coding (intron)

Hidden Markov Model for Codi Hidden Markov Model for Codi ng Sequence predictions ng Sequence predictions

HiddenHidden : State path : State path (I=intron, X=exon) (I=intron, X=exon) e.e.g.,g.,IIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXX

Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgggtcttaggtadtgtatctgggtcttaggtadtgtacggacggcccctcgtaggcacccctcgtaggca

The question is to find the The question is to find the most probable ( most probable ( hidden) state path hidden) state path - when the (non hidden) - when the (non hidden)

sequence is known. sequence is known.

TTTTTTTT TTTT TTT TTT TTTTTTTT TTTT TTT TTT coding sequences coding sequences ppredi credi c

t i onti on Best come from experimental wo Best come from experimental wo

rksrks Best come from the same species Best come from the same species

HMM HMM for Spliced for Spliced Alignment (between Alignment (between

genomic and EST genomic and EST sequences)sequences)

A/A G/G

T/T C/C

A G

T CPaired (exon) Unpaired (intron)

Selections of Alignment Selections of Alignment ProgramsPrograms

Global vs LocalGlobal vs Local Pairwise (1-1), database searching Pairwise (1-1), database searching

(1-many), module searching (1-1 (1-many), module searching (1-1 many loci), mulitiplemany loci), mulitiple

Distance between query and Distance between query and databasedatabase

Number of query, size of databasesNumber of query, size of databases Exact vs HeuristicExact vs Heuristic

Multiple sequence Multiple sequence alignmentalignment

Multiple sequence alignmentMultiple sequence alignment Dynamic programming: restricted to 3-4 Dynamic programming: restricted to 3-4

sequences at most.sequences at most. Progressive sequence alignment: ClustalW, X.Progressive sequence alignment: ClustalW, X. Divide and conquer methodologyDivide and conquer methodology HMMHMM OthersOthers

Constructing common patternsConstructing common patterns Consensus: TATAATConsensus: TATAAT Weight matrix Weight matrix Input (from training set) for HMM methodsInput (from training set) for HMM methods Input for PSI-BLASTInput for PSI-BLAST

Multiple Sequence Multiple Sequence Alignments: Creation Alignments: Creation

and Analysisand AnalysisChapter 12, B&O – Protein AlignmentChapter 12, B&O – Protein Alignment What is a Multiple Alignment?What is a Multiple Alignment? Structural or Evolutionary? (not Structural or Evolutionary? (not

necessarily correspond, not really necessarily correspond, not really possible)possible)

How to multiply align?How to multiply align? How to generate alignments?How to generate alignments? ToolsTools

Significance of an Significance of an Alignment ScoreAlignment Score

Statistical methods used to evaluate the Statistical methods used to evaluate the significance of an alignment scoresignificance of an alignment score Z-score, P-value and E-valueZ-score, P-value and E-value

Significance of ScoreSignificance of Score Z- score = (score – mean)/std. devZ- score = (score – mean)/std. dev

Measures how unusual our original match is. Measures how unusual our original match is. Z Z 5 are significant. 5 are significant.

P- value measures probability that the alignment is no P- value measures probability that the alignment is no better than random. (Z and P depends on the better than random. (Z and P depends on the distribution of the scores)distribution of the scores)

P P 10 10-100-100 exact match. exact match. E- value is the expected number of sequences that give E- value is the expected number of sequences that give

the same Z- score or better. (E = P x size of the the same Z- score or better. (E = P x size of the database)database)

E E 0.02 sequences probably homologous 0.02 sequences probably homologous

Aligning more than 2 Aligning more than 2 sequencessequences

Sequences should not be very Sequences should not be very different in lengthdifferent in length

Should be edited down to regions Should be edited down to regions that are most similar (PSI-BLAST that are most similar (PSI-BLAST does it automatically, but not all does it automatically, but not all tools do)tools do)

Random alignment of pairs of Random alignment of pairs of sequences helps assessing sequences helps assessing similaritiessimilarities

Multiple Sequence Multiple Sequence Alignment Alignment

- - N W or S W algorithms can be generalized to- - N W or S W algorithms can be generalized to >2 sequences. Its computational complexity >2 sequences. Its computational complexity

precludes their use for >3 sequences. precludes their use for >3 sequences. Heuristicapproaches, e.g. Heuristicapproaches, e.g. progressive alignment method progressive alignment method , are requir , are requir

ed.ed. These method cannot guarantee the best These method cannot guarantee the best

multiple alignment but in most cases give multiple alignment but in most cases give biologically meaningful results.biologically meaningful results.

Progressive Alignm Progressive Alignm ent Method ent Method

Each pair of sequences is aligned (e.g. by N-Each pair of sequences is aligned (e.g. by N-W method). W method).

Similarity in each pair is used for Similarity in each pair is used for constructing dendrogram relating each constructing dendrogram relating each sequence. sequence.

The most similar sequences are first aligned. The most similar sequences are first aligned. Then next most similar sequences or cluster Then next most similar sequences or cluster

of sequences are sequentially aligned.of sequences are sequentially aligned.

Progressive Alignm Progressive Alignm ent Method ent Method

A popular program is Clustal series.A popular program is Clustal series. ClustalV align up to 30 sequences, ClustalV align up to 30 sequences,

penalize left terminal gap but not penalize left terminal gap but not right terminal gap.right terminal gap.

ClustalW align up to 100 sequences, ClustalW align up to 100 sequences, not penalize terminal gaps. not penalize terminal gaps.

- ClustalX Windows based.- ClustalX Windows based.

Comparing a sequence with Comparing a sequence with a profile of a group a profile of a group

sequences.sequences. Testing arm of HMM.Testing arm of HMM. Searching arm of PSI-BLAST: for Searching arm of PSI-BLAST: for

more sensitive search of homologous more sensitive search of homologous sequencessequences

With profile of protein sequences for With profile of protein sequences for comparative molecular modeling.comparative molecular modeling.

PSI-BLASTPSI-BLAST

A profile search methodA profile search method A query sequence is used to search for A query sequence is used to search for

similar sequences.similar sequences. The sequences were used to generate a The sequences were used to generate a

sequence profile.sequence profile. The profile was again search against the The profile was again search against the

databases.databases. The method increase sensitivity of The method increase sensitivity of

search over normal BLAST. False search over normal BLAST. False positive can be a problem.positive can be a problem.

BLASTBLAST

Basic Local Alignment Search ToolBasic Local Alignment Search Tool Altschul et al, 1990Altschul et al, 1990 HeuristicHeuristic

Makes list of wordsMakes list of words

Fixed-length subsequencesFixed-length subsequences Default 3 protein, or 11 nucleotidesDefault 3 protein, or 11 nucleotides

Keeps words that match the query Keeps words that match the query with score above some thresholdwith score above some threshold

See file “triples.ss” for some See file “triples.ss” for some discussion of thresholdsdiscussion of thresholds

Searches database for words in this Searches database for words in this setset

triples.ss

When it finds a sequence When it finds a sequence containing a word in the containing a word in the

setset Uses that as a “seed” for hit Uses that as a “seed” for hit

extensionextension In both directionsIn both directions Extending the possible match as an Extending the possible match as an

ungapped alignmentungapped alignment After version 2.0 BLAST can handle After version 2.0 BLAST can handle

gapsgaps

FASTA IdeaFASTA Idea

IdeaIdea: a good alignment probably : a good alignment probably matches some identical ‘words’ (matches some identical ‘words’ (ktupsktups))

Example:Example:

Database record:Database record:

ACTTGTAGATACAAAATGTGACTTGTAGATACAAAATGTG

Aligned query sequence:Aligned query sequence:

A-TTGTCG-TACAA-ATCTGTA-TTGTCG-TACAA-ATCTGT

Matching words of size 4Matching words of size 4

Dictionaries of WordsDictionaries of Words

ACTTGTAGATAC ACTTGTAGATAC Is translated to the Is translated to the dictionary:dictionary:

ACTT,ACTT,

CTTG,CTTG,

TTGT,TTGT,

TGTATGTA……

Dictionaries of well aligned sequences Dictionaries of well aligned sequences share words.share words.

FASTA Stage IFASTA Stage I Prepare dictionary for db sequence (in Prepare dictionary for db sequence (in

advance)advance) Upon query:Upon query:

Prepare dictionary for query sequencePrepare dictionary for query sequence For each DB record:For each DB record:

Find matching wordsFind matching words Search for long Search for long diagonal runsdiagonal runs

of matching words of matching words Init-1 scoreInit-1 score: longest run: longest run Discard record if low scoreDiscard record if low score

*= matching word

Position in query

Position in DB record

* * * *

* * *

* * * * *

FASTA stage IIFASTA stage II

Good alignment – path Good alignment – path through many runs, withthrough many runs, withshort short connectionsconnections

Assign weights to runs(+)Assign weights to runs(+)and connections(-)and connections(-)

Find a path of max weightFind a path of max weight Init-n scoreInit-n score – total path – total path

weightweight Discard record if low scoreDiscard record if low score

FASTA Stage IIIFASTA Stage III

Improve Improve Init-1. Init-1. Apply Apply anan exact algorithm exact algorithm aroundaround Init-1 Init-1 diagonal within a diagonal within a given width band.given width band.

Init-1 Opt-scoreInit-1 Opt-score – – new weightnew weight

Discard record if low Discard record if low scorescore

FASTA final stageFASTA final stage

Apply an exact algorithm to Apply an exact algorithm to surviving records, computing the surviving records, computing the final alignment score.final alignment score.

BLAST BLAST (Basic Local Alignment Search (Basic Local Alignment Search

Tool)Tool) Approximate Matches Approximate Matches

BLAST:BLAST:

Words are allowed to contain inexact Words are allowed to contain inexact matching.matching.

Example:Example:

In the polypeptide sequence In the polypeptide sequence IHAVEADREAMIHAVEADREAM

The 4-long word The 4-long word HAVEHAVE starting at position 2 starting at position 2 may matchmay match

HAVE,RAVE,HIVE,HALE,…HAVE,RAVE,HIVE,HALE,…

Approximate MatchesApproximate Matches

For each For each wordword of length of length ww from a Data Base generate all from a Data Base generate all similarsimilar words. words.

‘‘Similar’Similar’ means: score( means: score( wordword, , word’word’ ) > T ) > T

Store all similar words in a look-up table.Store all similar words in a look-up table.

DB searchDB search

1) For each 1) For each wordword of length of length ww from a query sequence generate all from a query sequence generate all similarsimilar words.words.

2) Access DB.2) Access DB.

3) Each 3) Each hithit extend as much as possible -> High-scoring Segment Pair (HSP) extend as much as possible -> High-scoring Segment Pair (HSP)

score(HSP) > Vscore(HSP) > V

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

DB searchDB search

s-query

s-db

4) Around HSP perform DP.

At each step alignment score should be > T

starting point (seed pair)

B&O, chapter 12: B&O, chapter 12: HIERARCHICAL HIERARCHICAL

METHODSMETHODS Some of the most accurate practical methodsSome of the most accurate practical methods Work by finding the guide tree to build the Work by finding the guide tree to build the

alignmentalignment ClustalW is a hierarchical multiple alignment ClustalW is a hierarchical multiple alignment

program. It uses a series of different pair-score program. It uses a series of different pair-score matrices, biases the location of gaps and allows matrices, biases the location of gaps and allows to realign aligned sequencesto realign aligned sequences

T-coffee builds a library of pairwise alignmentsT-coffee builds a library of pairwise alignments Psi-Blast – “profile-based” method Psi-Blast – “profile-based” method

Why we do multiple Why we do multiple alignments?alignments?

Multiple nucleotide or amino sequence Multiple nucleotide or amino sequence alignment techniques are usually performed to alignment techniques are usually performed to fit one of the following scopes :fit one of the following scopes :

– In order to characterize protein families, In order to characterize protein families, identify shared regions of homology in a identify shared regions of homology in a multiple sequence alignment; (this happens multiple sequence alignment; (this happens generally when a sequence search revealed generally when a sequence search revealed homologies to several sequences) homologies to several sequences)

– Determination of the consensus sequence of Determination of the consensus sequence of several aligned sequences.several aligned sequences.

Why we do multiple alignments?Why we do multiple alignments?

– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;

– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.

An example of Multiple An example of Multiple AlignmentAlignment

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple Alignment MethodMultiple Alignment Method

The most practical and widely used method The most practical and widely used method in multiple sequence alignment is the in multiple sequence alignment is the hierarchical extensions of pairwise hierarchical extensions of pairwise alignment methods. alignment methods.

The principal is that multiple alignments is The principal is that multiple alignments is achieved by successive application of achieved by successive application of pairwise methodspairwise methods..

Multiple Alignment MethodMultiple Alignment Method The steps are summarized as follows:The steps are summarized as follows: Compare all sequences pairwise. Compare all sequences pairwise. Perform cluster analysis on the pairwise data to generate Perform cluster analysis on the pairwise data to generate

a hierarchy for alignment. This may be in the form of a a hierarchy for alignment. This may be in the form of a binary tree or a simple orderingbinary tree or a simple ordering

Build the multiple alignment by first aligning the most Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged alignments of A and C with that of B and D using averaged scores at each aligned position.scores at each aligned position.

Choosing sequences for Choosing sequences for alignmentalignment

General considerationsGeneral considerations

The more sequences to align the better.The more sequences to align the better. Don’t include similar (>80%) sequences.Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned Sub-groups should be pre-aligned

separately, and one member of each separately, and one member of each subgroup should be included in the final subgroup should be included in the final multiple alignment.multiple alignment.

Multiple alignment in Multiple alignment in GCGGCG

The program available in GCG for multiple The program available in GCG for multiple alignment is Pileup.alignment is Pileup.

The input file for Pileup is a list of sequence The input file for Pileup is a list of sequence file_names or sequence codes in the database, file_names or sequence codes in the database, created by a text editor.created by a text editor.

Pileup creates a multiple sequence alignment from Pileup creates a multiple sequence alignment from a group of related sequences using progressive, a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing pairwise alignments. It can also plot a tree showing the clustering relationships used to create the the clustering relationships used to create the alignment.alignment.

Please note that there is no one absolute alignment, Please note that there is no one absolute alignment, even for a limited number of sequences.even for a limited number of sequences.

Output of PileupOutput of Pileup//

1 OATNFA1 ~~~~~~~~~~ ~~~~~~~~~~ ~GGCCAAGAG OATNFAR ~~~~~GGGAC ACCAGGGGAC CAGCCAAGAG BSPTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CEU14683 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ HSTNFR ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~GCAGASYNTNFTRP AGCAGACGCT CCCTCAGCAA GGACAGCAGA CATTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CFTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ RABTNFM ~~~~AAGCTC CCTCAGTGAG GACACGGGCA RNTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

ShadyBox OutputShadyBox Output

Multiple alignment Multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Multiple alignment Multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into

trouble when data is trouble when data is not globally related!!!not globally related!!!



Clustalx



Clustalx

Possible solutions:(1) Cut out conserved regions of interest and THEN align them (2) Use method that deals with local similarity (e.g. DIALIGN)

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustaW is a general purpose multiple alignment ClustaW is a general purpose multiple alignment program for DNA or proteins.program for DNA or proteins.

ClustalW is produced by Julie D. Thompson, ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, European Bioinformatics Institute, Cambridge, UK. AlgorithmicUK. Algorithmic

ClustalW is cited: improving the sensitivity of ClustalW is cited: improving the sensitivity of progressive multiple sequence alignment through progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, and weight matrix choice. Nucleic Acids Research, 22:4673-4680.22:4673-4680.

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.

Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate

- fast/approximate- fast/approximate

Running ClustalW Running ClustalW [~]% clustalw

************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************

1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees

S. Execute a system command H. HELP X. EXIT (leave program)

Your choice:

Running ClustalWRunning ClustalW

The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.

Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters 6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Your choice:

Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment

HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:

1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB

Fast/Approximate alignments:

5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4

9. Toggle Slow/Fast pairwise alignments = SLOW

H. HELPEnter number (or [RETURN] to exit):

ClustalW optionsClustalW optionsYour choice: 6

********* MULTIPLE ALIGNMENT PARAMETERS *********

1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %

4. DNA Transitions Weight :0.50

5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF

8. Protein Gap Parameters

H. HELP

Enter number (or [RETURN] to exit):

ClustalX - Multiple Sequence ClustalX - Multiple Sequence Alignment ProgramAlignment Program

ClustalX provides a new window-based ClustalX provides a new window-based user interface to the ClustalW program. user interface to the ClustalW program.

It uses the Vibrant multi-platform user It uses the Vibrant multi-platform user interface development library, developed by interface development library, developed by the National Center for Biotechnology the National Center for Biotechnology Information (Bldg 38A, NIH 8600 Rockville Information (Bldg 38A, NIH 8600 Rockville Pike,Bethesda, MD 20894) as part of their Pike,Bethesda, MD 20894) as part of their NCBI SOFTWARE DEVELOPEMENT TOOLKITNCBI SOFTWARE DEVELOPEMENT TOOLKIT. .

ClustalXClustalX

ClustalXClustalX

ClustalXClustalX

ClustalXClustalX

ClustalXClustalX

ClustalXClustalX

Blocks database and toolsBlocks database and tools

Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.

The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.

They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.

The BLOCKS web The BLOCKS web serverserver

At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/

The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.

The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.

The Blocks Searcher The Blocks Searcher tooltool

For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.

This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.

The Blocks Searcher toolThe Blocks Searcher tool

Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.

The BLOCKS DatabaseThe BLOCKS Database

The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.

The Block Maker ToolThe Block Maker Tool

Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.

Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.

Input sequences must be in FastA format.Input sequences must be in FastA format.

Results are returned by e-mail.Results are returned by e-mail.

vsvs Clustal – room for Clustal – room for improvementimprovement

Gaps are consistent with the phylogenetic tree

CLUSTALW-artifacts?CLUSTALW-artifacts?

Gaps are largely inconsistent with the phylogenetic tree

Misc LinksMisc Links

http://hits.isb-sib.ch/cgi-bin/PFSCANhttp://hits.isb-sib.ch/cgi-bin/PFSCAN

http://www.soe.ucsc.edu/research/http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-compbio/HMM-apps/HMM-applications.htmlapplications.html

http://server1-kimlab.stanford.edu/cgi-http://server1-kimlab.stanford.edu/cgi-bin/index.cgi?BigFigures+ZahnFig4bin/index.cgi?BigFigures+ZahnFig4

Misc LinksMisc Links

CAMDA competition --- CAMDA competition --- http://www.camda.duke.edu/

Baylor college Sequencing CenterBaylor college Sequencing Center http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/

projects/rmacaque/projects/rmacaque/ http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/

projects/chimpanzee/ projects/chimpanzee/

Documents

Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD