Upload
miles-williamson
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Design and creation of Design and creation of multiple sequence multiple sequence
alignmentsalignmentsUnit 13Unit 13
BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for
BiotechnologyBiotechnologyIrene Gabashvili, PhD
Dot Plot (Matrix) for Dot Plot (Matrix) for Sequence comparisonSequence comparison
Reminders from Previous Reminders from Previous LecturesLectures
DOTPLOTSDOTPLOTS
D
OR
OT
HY
HO
DG
KIN
DO
RO
TH
YH
OD
GK
IN
DOROTHYCROWFOOTHODGKINDOROTHYCROWFOOTHODGKIN
A T G C A G T TA T G C A G T T
Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison
A T G C A G T TA T G C A G T T
Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison
Identity diagonal
A T G C A T G CA T G C A T G C
Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison
Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison
Identity diagonal
Direct Repeat
A T G C A T G CA T G C A T G C
A T G G A G T TA T G CA G T T
Dot Matrix: Point Dot Matrix: PointMutationMutation
Dot Matrix: Point Dot Matrix: PointMutationMutation
Main diagonal
Point mutation
A T G G A G T TA T G CA G T T
A T G A A C A G T TA T G C A G T T
Dot Matrix: Ga Dot Matrix: Gapp
Dot Matrix: Ga Dot Matrix: Gapp
Main diagonal
Deletion/Insertion
A T G A A C A G T TA T G C A G T T
A G T T A T G CA T G C A G T T
Dot Matrix: Rearra Dot Matrix: Rearrangementngement
Dot Matrix: Rearra Dot Matrix: Rearrangementngement
Main diagonal
A G T T A T G CA T G C A G T T
Dot Plot Analysis Dot Plot Analysis
AdvantagesAdvantages Simple and fast. Simple and fast. Can detect DNA rearrangemen Can detect DNA rearrangemen
tt DisadvantagesDisadvantages
No numerical values produced No numerical values produced Subjective interpretation Subjective interpretation
Problems of Sequenc Problems of Sequenc e Alignment e Alignment
How to score? Match, Mismat How to score? Match, Mismat ch and Gap ch and Gap
Example: Example: +1 for each match +1 for each match , , 0for mismatch 0for mismatch and and - 2 for each- 2 for each
internal gap internal gap ( (ggg gggggggggg ggggggg g), 0 g), 0 or terminal gap ( or terminal gap ( similarity sco similarity sco
gggggggg
Computational measuresComputational measures
Distance measureDistance measure 0 for a match0 for a match 1 for a mismatch or gap1 for a mismatch or gap Lowest bestLowest best
Another measureAnother measure 2 for a match2 for a match -1 for a mismatch, -2 for a gap-1 for a mismatch, -2 for a gap highest besthighest best
Gap PenaltiesGap Penalties
Gap penaltiesGap penalties Linear score f(g) = - gdLinear score f(g) = - gd Affine score f(g) = - d – (g-1) eAffine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend d = gap open penalty e = gap extend penaltypenalty
g = gap lengthg = gap length
Example Gap penalty values used:Example Gap penalty values used: d = 500 d = 500 e = 50e = 50
Example from Lab-Example from Lab-Feb20:Feb20:
-1 for terminal gap, -2 for -1 for terminal gap, -2 for for each internal gap for each internal gap ( ( gap penalty gap penalty)) Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,A) = 4; Blosum(A,P) = -1;
Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4
AAWWAAPP -1-1-3-1+7=2 -3-1+7=2 (one terminal gap, 2 (one terminal gap, 2 mismatches)mismatches)
- - AAPPPP AWAWAPAP - - -3-3+4+7=8 +4+7=8 (3 terminal gaps, no mismatches)(3 terminal gaps, no mismatches)
- -- -APAPP best if gap penalty (inside) is highP best if gap penalty (inside) is high AWAP AWAP -2-2+4-1+7=8 +4-1+7=8 (one internal gap, 1 mismatch)(one internal gap, 1 mismatch)
A - PP best if terminal gap is highA - PP best if terminal gap is high
How to find the alig How to find the alig nment with the best nment with the best
score? score?
Finding alignment wi Finding alignment wi th best score th best score
Brute force approach Brute force approach = calculat = calculat ing scores of all possible alignm ing scores of all possible alignm ent and select the best ones. ent and select the best ones.
For For -two 1000 bp -two 1000 bp DNA sequence DNA sequence , the number of possible alignm , the number of possible alignm
ent is ent is1010600600 . Brute force appro . Brute force appro ach is impossible. ach is impossible.
Dynamic programmin Dynamic programmin g Methods g Methods
Finding the best alignment without Finding the best alignment without calculating all possible alignment. calculating all possible alignment.
The method is The method is EXACTEXACT.. Original method by Original method by Needleman&WNeedleman&W
unschunsch performs performs global alignment global alignment.. Modification by Modification by Smith&WatermanSmith&Waterman
performs performs local alignment local alignment..
A T G A A C A G T TA 1 -1 -3 -5 -7 -9 -11 -13 -15 -17T -1 2 0 -2 -4 -6 -8 -10 -12 -14G -3 0 3 1 -1 -3 -5 -7 -9 -11C -5 -2 1 3 1 0 -2 -4 -6 -8A -7 -4 -1 2 4 2 1 -1 -3 -5G -9 -6 -3 0 2 4 2 2 0 -2T -11 -8 -5 -2 0 2 4 2 3 1T -13 -10 -7 -4 -2 0 2 4 2 4
Needleman&Wunsch Needleman&Wunsch Methods (match=1, m Methods (match=1, m
-ismatch=0 , gap= 2 ) -ismatch=0 , gap= 2 )
Local Alignment wit Local Alignment wit - h Smith Waterman - h Smith Waterman
AlgorithmAlgorithm Adding one modification: Any n Adding one modification: Any n
egative score are changed to egative score are changed to 0 . That is alignment will not b 0 . That is alignment will not b
e done unl ess t he score i s po e done unl ess t he score i s posi t i vesi t i ve
A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5
- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis
-match=0 , gap= 2 ) -match=0 , gap= 2 )
A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5
- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis
-match=0 , gap= 2 ) -match=0 , gap= 2 )
Scoring scheme Scoring schemess
Although dynamic programming g Although dynamic programming g uarantee correct results for each uarantee correct results for each
scoring scheme. The biological b scoring scheme. The biological b asis of scoring scheme is weak, e asis of scoring scheme is weak, e
xcept for the fact that insertion/d xcept for the fact that insertion/d eletion is rarer than substitution eletion is rarer than substitution
s and scored accordingly s and scored accordingly
-Match Mismatc-Match Mismatc h score h score
DNADNA Transition is more frequent than transv Transition is more frequent than transv
ersion ersion (e.g., for (e.g., for M. tuberculosisM. tuberculosis SNP ~ SNP ~ 2:1)2:1) and can be scored accordingly and can be scored accordingly..
In practice base transition and In practice base transition and transversion are usually scored equally.transversion are usually scored equally.
ProteinsProteins Substitution matrix such as PAM or Substitution matrix such as PAM or
BLOSUMBLOSUM
Transitions & Transitions & TransversionsTransversions
Transition: A nucleotide substitution Transition: A nucleotide substitution from one purine to another purine from one purine to another purine (eg, A->G), or from one pyrimidine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C).to another pyrimidine (eg, T->C).
Transversion: A nucleotide Transversion: A nucleotide substitution from a purine to a substitution from a purine to a pyrimidine (eg, A->C), or vice versa pyrimidine (eg, A->C), or vice versa (eg, T->G).(eg, T->G).
Transitions & Transitions & TransversionsTransversions
PurinesPurines Pyrimidines
Gap penalty Gap penalty
Linear model Linear model = = kk Affine model Affine model = = 00 + + k, k, 00 = = gap opening gap opening
penalltypenallty , , k= k= gap extension penalty gap extension penalty . .00
More biologically realistic model More biologically realistic modelss needneed e e xponentially decrease gap penalty functi xponentially decrease gap penalty functi
ons such as ons such as 00 + + Logk. C Logk. C omputational omputational
complexity prohibits its common use. complexity prohibits its common use.
More advance scoring sys More advance scoring systemtem
Position dependent scores, use di Position dependent scores, use di fferent matrix (and penalty) at dif fferent matrix (and penalty) at dif
ferent position in proteins. Funct ferent position in proteins. Funct ional importance of protein regio ional importance of protein regio
ns affect divergence ns affect divergence Structure dependent scores. Structure dependent scores.
Software providing Software providing ALIGNMENT toolsALIGNMENT tools
MATLAB: Bioinformatics toolboxMATLAB: Bioinformatics toolbox
[GlobalScore, GlobalAlignment] = [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... nwalign(humanProtein,... mouseProtein) mouseProtein)
… … swalignswalign
showalignment(GlobalAlignment) showalignment(GlobalAlignment)
ORACLE 10g BLAST functions: blastn, ORACLE 10g BLAST functions: blastn, blastp, blastx, etc blastp, blastx, etc
Types of AlgorithmsTypes of Algorithms
Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.
In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.
Dynamic Programming The algorithm for finding optimal alignments
given an additive alignment score dynamically These type of algorithms are guaranteed to find
the optimal scoring alignment or set of alignments.
HMM - Based on Probability Theory – very versatile.
http://www.soe.ucsc.edu/http://www.soe.ucsc.edu/research/compbio/HMM-research/compbio/HMM-
apps/HMM-apps/HMM-applications.htmlapplications.html
Hidden Markov M Hidden Markov M odel (HMM) odel (HMM)
Markov chain Markov chain Chain of events, in which the Chain of events, in which the prpr
obability of each event obability of each event depend dependss only on only on aa preceding event preceding event..
Assumption: Assumption: DNA can be viewed DNA can be viewed as a Markov chain as a Markov chain . Probability o . Probability o
f A, T, G, or C appearing in each f A, T, G, or C appearing in each position depend on kind of nucle position depend on kind of nucle
otide in the preceding position. otide in the preceding position.
Markov chain is defi Markov chain is defi ned by ned by
P(A|A) = probability of a base be P(A|A) = probability of a base be ing A if the preceding base is A. ing A if the preceding base is A.
P(T|G) = probability of a base be P(T|G) = probability of a base be ing T if the preceding base is G. ing T if the preceding base is G.
And so on. And so on. So a DNA Markov So a DNA Markov chain is defined by 16 chain is defined by 16 probabilities.probabilities.
Markov Chain Model of DNA. E Markov Chain Model of DNA. E ach arrow is defined by a transit ach arrow is defined by a transit
ion probability. ion probability.
A G
T C
Hi ddenMarkov Model Hi ddenMarkov Model
HiddenHidden : State path e.g., : State path e.g.,NNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNN
Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgatctg
The question is to find the The question is to find the most pr most pr obable (hidden) state path obable (hidden) state path when th when th
- e (non hidden) sequence is known. - e (non hidden) sequence is known.
Algorithm to find Most Pr Algorithm to find Most Pr obable State Path (Decodi obable State Path (Decodi
ng)ng)
If parameters are known, If parameters are known, Viterbi algorithm Viterbi algorithm..Posterior decodingPosterior decoding
Esti mati onof parameters Esti mati onof parameters
Usually a “training set” of Usually a “training set” of sequences are required. sequences are required.
The “training set” may beThe “training set” may be Sequences of known stateSequences of known state Sequences of unknown state. Sequences of unknown state.
Parameters are arbitrarily set and Parameters are arbitrarily set and reiterated until state changes are reiterated until state changes are minimal.minimal.
HMM HMM for identifying for identifying coding coding DNA Sequences DNA Sequences
A G
T C
A G
T CCoding (exon) -Non Coding (intron)
Hidden Markov Model for Codi Hidden Markov Model for Codi ng Sequence predictions ng Sequence predictions
HiddenHidden : State path : State path (I=intron, X=exon) (I=intron, X=exon) e.e.g.,g.,IIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXX
Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgggtcttaggtadtgtatctgggtcttaggtadtgtacggacggcccctcgtaggcacccctcgtaggca
The question is to find the The question is to find the most probable ( most probable ( hidden) state path hidden) state path - when the (non hidden) - when the (non hidden)
sequence is known. sequence is known.
TTTTTTTT TTTT TTT TTT TTTTTTTT TTTT TTT TTT coding sequences coding sequences ppredi credi c
t i onti on Best come from experimental wo Best come from experimental wo
rksrks Best come from the same species Best come from the same species
HMM HMM for Spliced for Spliced Alignment (between Alignment (between
genomic and EST genomic and EST sequences)sequences)
A/A G/G
T/T C/C
A G
T CPaired (exon) Unpaired (intron)
Selections of Alignment Selections of Alignment ProgramsPrograms
Global vs LocalGlobal vs Local Pairwise (1-1), database searching Pairwise (1-1), database searching
(1-many), module searching (1-1 (1-many), module searching (1-1 many loci), mulitiplemany loci), mulitiple
Distance between query and Distance between query and databasedatabase
Number of query, size of databasesNumber of query, size of databases Exact vs HeuristicExact vs Heuristic
Multiple sequence Multiple sequence alignmentalignment
Multiple sequence alignmentMultiple sequence alignment Dynamic programming: restricted to 3-4 Dynamic programming: restricted to 3-4
sequences at most.sequences at most. Progressive sequence alignment: ClustalW, X.Progressive sequence alignment: ClustalW, X. Divide and conquer methodologyDivide and conquer methodology HMMHMM OthersOthers
Constructing common patternsConstructing common patterns Consensus: TATAATConsensus: TATAAT Weight matrix Weight matrix Input (from training set) for HMM methodsInput (from training set) for HMM methods Input for PSI-BLASTInput for PSI-BLAST
Multiple Sequence Multiple Sequence Alignments: Creation Alignments: Creation
and Analysisand AnalysisChapter 12, B&O – Protein AlignmentChapter 12, B&O – Protein Alignment What is a Multiple Alignment?What is a Multiple Alignment? Structural or Evolutionary? (not Structural or Evolutionary? (not
necessarily correspond, not really necessarily correspond, not really possible)possible)
How to multiply align?How to multiply align? How to generate alignments?How to generate alignments? ToolsTools
Significance of an Significance of an Alignment ScoreAlignment Score
Statistical methods used to evaluate the Statistical methods used to evaluate the significance of an alignment scoresignificance of an alignment score Z-score, P-value and E-valueZ-score, P-value and E-value
Significance of ScoreSignificance of Score Z- score = (score – mean)/std. devZ- score = (score – mean)/std. dev
Measures how unusual our original match is. Measures how unusual our original match is. Z Z 5 are significant. 5 are significant.
P- value measures probability that the alignment is no P- value measures probability that the alignment is no better than random. (Z and P depends on the better than random. (Z and P depends on the distribution of the scores)distribution of the scores)
P P 10 10-100-100 exact match. exact match. E- value is the expected number of sequences that give E- value is the expected number of sequences that give
the same Z- score or better. (E = P x size of the the same Z- score or better. (E = P x size of the database)database)
E E 0.02 sequences probably homologous 0.02 sequences probably homologous
Aligning more than 2 Aligning more than 2 sequencessequences
Sequences should not be very Sequences should not be very different in lengthdifferent in length
Should be edited down to regions Should be edited down to regions that are most similar (PSI-BLAST that are most similar (PSI-BLAST does it automatically, but not all does it automatically, but not all tools do)tools do)
Random alignment of pairs of Random alignment of pairs of sequences helps assessing sequences helps assessing similaritiessimilarities
Multiple Sequence Multiple Sequence Alignment Alignment
- - N W or S W algorithms can be generalized to- - N W or S W algorithms can be generalized to >2 sequences. Its computational complexity >2 sequences. Its computational complexity
precludes their use for >3 sequences. precludes their use for >3 sequences. Heuristicapproaches, e.g. Heuristicapproaches, e.g. progressive alignment method progressive alignment method , are requir , are requir
ed.ed. These method cannot guarantee the best These method cannot guarantee the best
multiple alignment but in most cases give multiple alignment but in most cases give biologically meaningful results.biologically meaningful results.
Progressive Alignm Progressive Alignm ent Method ent Method
Each pair of sequences is aligned (e.g. by N-Each pair of sequences is aligned (e.g. by N-W method). W method).
Similarity in each pair is used for Similarity in each pair is used for constructing dendrogram relating each constructing dendrogram relating each sequence. sequence.
The most similar sequences are first aligned. The most similar sequences are first aligned. Then next most similar sequences or cluster Then next most similar sequences or cluster
of sequences are sequentially aligned.of sequences are sequentially aligned.
Progressive Alignm Progressive Alignm ent Method ent Method
A popular program is Clustal series.A popular program is Clustal series. ClustalV align up to 30 sequences, ClustalV align up to 30 sequences,
penalize left terminal gap but not penalize left terminal gap but not right terminal gap.right terminal gap.
ClustalW align up to 100 sequences, ClustalW align up to 100 sequences, not penalize terminal gaps. not penalize terminal gaps.
- ClustalX Windows based.- ClustalX Windows based.
Comparing a sequence with Comparing a sequence with a profile of a group a profile of a group
sequences.sequences. Testing arm of HMM.Testing arm of HMM. Searching arm of PSI-BLAST: for Searching arm of PSI-BLAST: for
more sensitive search of homologous more sensitive search of homologous sequencessequences
With profile of protein sequences for With profile of protein sequences for comparative molecular modeling.comparative molecular modeling.
PSI-BLASTPSI-BLAST
A profile search methodA profile search method A query sequence is used to search for A query sequence is used to search for
similar sequences.similar sequences. The sequences were used to generate a The sequences were used to generate a
sequence profile.sequence profile. The profile was again search against the The profile was again search against the
databases.databases. The method increase sensitivity of The method increase sensitivity of
search over normal BLAST. False search over normal BLAST. False positive can be a problem.positive can be a problem.
BLASTBLAST
Basic Local Alignment Search ToolBasic Local Alignment Search Tool Altschul et al, 1990Altschul et al, 1990 HeuristicHeuristic
Makes list of wordsMakes list of words
Fixed-length subsequencesFixed-length subsequences Default 3 protein, or 11 nucleotidesDefault 3 protein, or 11 nucleotides
Keeps words that match the query Keeps words that match the query with score above some thresholdwith score above some threshold
See file “triples.ss” for some See file “triples.ss” for some discussion of thresholdsdiscussion of thresholds
Searches database for words in this Searches database for words in this setset
triples.ss
When it finds a sequence When it finds a sequence containing a word in the containing a word in the
setset Uses that as a “seed” for hit Uses that as a “seed” for hit
extensionextension In both directionsIn both directions Extending the possible match as an Extending the possible match as an
ungapped alignmentungapped alignment After version 2.0 BLAST can handle After version 2.0 BLAST can handle
gapsgaps
FASTA IdeaFASTA Idea
IdeaIdea: a good alignment probably : a good alignment probably matches some identical ‘words’ (matches some identical ‘words’ (ktupsktups))
Example:Example:
Database record:Database record:
ACTTGTAGATACAAAATGTGACTTGTAGATACAAAATGTG
Aligned query sequence:Aligned query sequence:
A-TTGTCG-TACAA-ATCTGTA-TTGTCG-TACAA-ATCTGT
Matching words of size 4Matching words of size 4
Dictionaries of WordsDictionaries of Words
ACTTGTAGATAC ACTTGTAGATAC Is translated to the Is translated to the dictionary:dictionary:
ACTT,ACTT,
CTTG,CTTG,
TTGT,TTGT,
TGTATGTA……
Dictionaries of well aligned sequences Dictionaries of well aligned sequences share words.share words.
FASTA Stage IFASTA Stage I Prepare dictionary for db sequence (in Prepare dictionary for db sequence (in
advance)advance) Upon query:Upon query:
Prepare dictionary for query sequencePrepare dictionary for query sequence For each DB record:For each DB record:
Find matching wordsFind matching words Search for long Search for long diagonal runsdiagonal runs
of matching words of matching words Init-1 scoreInit-1 score: longest run: longest run Discard record if low scoreDiscard record if low score
*= matching word
Position in query
Position in DB record
* * * *
* * *
* * * * *
FASTA stage IIFASTA stage II
Good alignment – path Good alignment – path through many runs, withthrough many runs, withshort short connectionsconnections
Assign weights to runs(+)Assign weights to runs(+)and connections(-)and connections(-)
Find a path of max weightFind a path of max weight Init-n scoreInit-n score – total path – total path
weightweight Discard record if low scoreDiscard record if low score
FASTA Stage IIIFASTA Stage III
Improve Improve Init-1. Init-1. Apply Apply anan exact algorithm exact algorithm aroundaround Init-1 Init-1 diagonal within a diagonal within a given width band.given width band.
Init-1 Opt-scoreInit-1 Opt-score – – new weightnew weight
Discard record if low Discard record if low scorescore
FASTA final stageFASTA final stage
Apply an exact algorithm to Apply an exact algorithm to surviving records, computing the surviving records, computing the final alignment score.final alignment score.
BLAST BLAST (Basic Local Alignment Search (Basic Local Alignment Search
Tool)Tool) Approximate Matches Approximate Matches
BLAST:BLAST:
Words are allowed to contain inexact Words are allowed to contain inexact matching.matching.
Example:Example:
In the polypeptide sequence In the polypeptide sequence IHAVEADREAMIHAVEADREAM
The 4-long word The 4-long word HAVEHAVE starting at position 2 starting at position 2 may matchmay match
HAVE,RAVE,HIVE,HALE,…HAVE,RAVE,HIVE,HALE,…
Approximate MatchesApproximate Matches
For each For each wordword of length of length ww from a Data Base generate all from a Data Base generate all similarsimilar words. words.
‘‘Similar’Similar’ means: score( means: score( wordword, , word’word’ ) > T ) > T
Store all similar words in a look-up table.Store all similar words in a look-up table.
DB searchDB search
1) For each 1) For each wordword of length of length ww from a query sequence generate all from a query sequence generate all similarsimilar words.words.
2) Access DB.2) Access DB.
3) Each 3) Each hithit extend as much as possible -> High-scoring Segment Pair (HSP) extend as much as possible -> High-scoring Segment Pair (HSP)
score(HSP) > Vscore(HSP) > V
THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN
DB searchDB search
s-query
s-db
4) Around HSP perform DP.
At each step alignment score should be > T
starting point (seed pair)
B&O, chapter 12: B&O, chapter 12: HIERARCHICAL HIERARCHICAL
METHODSMETHODS Some of the most accurate practical methodsSome of the most accurate practical methods Work by finding the guide tree to build the Work by finding the guide tree to build the
alignmentalignment ClustalW is a hierarchical multiple alignment ClustalW is a hierarchical multiple alignment
program. It uses a series of different pair-score program. It uses a series of different pair-score matrices, biases the location of gaps and allows matrices, biases the location of gaps and allows to realign aligned sequencesto realign aligned sequences
T-coffee builds a library of pairwise alignmentsT-coffee builds a library of pairwise alignments Psi-Blast – “profile-based” method Psi-Blast – “profile-based” method
Why we do multiple Why we do multiple alignments?alignments?
Multiple nucleotide or amino sequence Multiple nucleotide or amino sequence alignment techniques are usually performed to alignment techniques are usually performed to fit one of the following scopes :fit one of the following scopes :
– In order to characterize protein families, In order to characterize protein families, identify shared regions of homology in a identify shared regions of homology in a multiple sequence alignment; (this happens multiple sequence alignment; (this happens generally when a sequence search revealed generally when a sequence search revealed homologies to several sequences) homologies to several sequences)
– Determination of the consensus sequence of Determination of the consensus sequence of several aligned sequences.several aligned sequences.
Why we do multiple alignments?Why we do multiple alignments?
– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;
– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.
An example of Multiple An example of Multiple AlignmentAlignment
VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Multiple Alignment MethodMultiple Alignment Method
The most practical and widely used method The most practical and widely used method in multiple sequence alignment is the in multiple sequence alignment is the hierarchical extensions of pairwise hierarchical extensions of pairwise alignment methods. alignment methods.
The principal is that multiple alignments is The principal is that multiple alignments is achieved by successive application of achieved by successive application of pairwise methodspairwise methods..
Multiple Alignment MethodMultiple Alignment Method The steps are summarized as follows:The steps are summarized as follows: Compare all sequences pairwise. Compare all sequences pairwise. Perform cluster analysis on the pairwise data to generate Perform cluster analysis on the pairwise data to generate
a hierarchy for alignment. This may be in the form of a a hierarchy for alignment. This may be in the form of a binary tree or a simple orderingbinary tree or a simple ordering
Build the multiple alignment by first aligning the most Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged alignments of A and C with that of B and D using averaged scores at each aligned position.scores at each aligned position.
Choosing sequences for Choosing sequences for alignmentalignment
General considerationsGeneral considerations
The more sequences to align the better.The more sequences to align the better. Don’t include similar (>80%) sequences.Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned Sub-groups should be pre-aligned
separately, and one member of each separately, and one member of each subgroup should be included in the final subgroup should be included in the final multiple alignment.multiple alignment.
Multiple alignment in Multiple alignment in GCGGCG
The program available in GCG for multiple The program available in GCG for multiple alignment is Pileup.alignment is Pileup.
The input file for Pileup is a list of sequence The input file for Pileup is a list of sequence file_names or sequence codes in the database, file_names or sequence codes in the database, created by a text editor.created by a text editor.
Pileup creates a multiple sequence alignment from Pileup creates a multiple sequence alignment from a group of related sequences using progressive, a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing pairwise alignments. It can also plot a tree showing the clustering relationships used to create the the clustering relationships used to create the alignment.alignment.
Please note that there is no one absolute alignment, Please note that there is no one absolute alignment, even for a limited number of sequences.even for a limited number of sequences.
Output of PileupOutput of Pileup//
1 OATNFA1 ~~~~~~~~~~ ~~~~~~~~~~ ~GGCCAAGAG OATNFAR ~~~~~GGGAC ACCAGGGGAC CAGCCAAGAG BSPTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CEU14683 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ HSTNFR ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~GCAGASYNTNFTRP AGCAGACGCT CCCTCAGCAA GGACAGCAGA CATTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CFTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ RABTNFM ~~~~AAGCTC CCTCAGTGAG GACACGGGCA RNTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~
ShadyBox OutputShadyBox Output
Multiple alignment Multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
Multiple alignment Multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into
trouble when data is trouble when data is not globally related!!!not globally related!!!
Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into
trouble when data is trouble when data is not globally related!!!not globally related!!!
Clustalx
Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into
trouble when data is trouble when data is not globally related!!!not globally related!!!
Clustalx
Possible solutions:(1) Cut out conserved regions of interest and THEN align them (2) Use method that deals with local similarity (e.g. DIALIGN)
ClustalW- for multiple ClustalW- for multiple alignmentalignment
ClustaW is a general purpose multiple alignment ClustaW is a general purpose multiple alignment program for DNA or proteins.program for DNA or proteins.
ClustalW is produced by Julie D. Thompson, ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, European Bioinformatics Institute, Cambridge, UK. AlgorithmicUK. Algorithmic
ClustalW is cited: improving the sensitivity of ClustalW is cited: improving the sensitivity of progressive multiple sequence alignment through progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, and weight matrix choice. Nucleic Acids Research, 22:4673-4680.22:4673-4680.
ClustalW- for multiple ClustalW- for multiple alignmentalignment
ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.
Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate
- fast/approximate- fast/approximate
Running ClustalW Running ClustalW [~]% clustalw
************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************
1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees
S. Execute a system command H. HELP X. EXIT (leave program)
Your choice:
Running ClustalWRunning ClustalW
The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.
Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters 6. Multiple alignment parameters
7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options
S. Execute a system command H. HELP or press [RETURN] to go back to main menu
Your choice:
Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment
HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *
ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:
1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB
Fast/Approximate alignments:
5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4
9. Toggle Slow/Fast pairwise alignments = SLOW
H. HELPEnter number (or [RETURN] to exit):
ClustalW optionsClustalW optionsYour choice: 6
********* MULTIPLE ALIGNMENT PARAMETERS *********
1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %
4. DNA Transitions Weight :0.50
5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF
8. Protein Gap Parameters
H. HELP
Enter number (or [RETURN] to exit):
ClustalX - Multiple Sequence ClustalX - Multiple Sequence Alignment ProgramAlignment Program
ClustalX provides a new window-based ClustalX provides a new window-based user interface to the ClustalW program. user interface to the ClustalW program.
It uses the Vibrant multi-platform user It uses the Vibrant multi-platform user interface development library, developed by interface development library, developed by the National Center for Biotechnology the National Center for Biotechnology Information (Bldg 38A, NIH 8600 Rockville Information (Bldg 38A, NIH 8600 Rockville Pike,Bethesda, MD 20894) as part of their Pike,Bethesda, MD 20894) as part of their NCBI SOFTWARE DEVELOPEMENT TOOLKITNCBI SOFTWARE DEVELOPEMENT TOOLKIT. .
ClustalXClustalX
ClustalXClustalX
ClustalXClustalX
ClustalXClustalX
ClustalXClustalX
ClustalXClustalX
Blocks database and toolsBlocks database and tools
Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.
The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.
They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.
The BLOCKS web The BLOCKS web serverserver
At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/
The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.
The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.
The Blocks Searcher The Blocks Searcher tooltool
For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.
This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.
The Blocks Searcher toolThe Blocks Searcher tool
Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.
The BLOCKS DatabaseThe BLOCKS Database
The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.
The Block Maker ToolThe Block Maker Tool
Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.
Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.
Input sequences must be in FastA format.Input sequences must be in FastA format.
Results are returned by e-mail.Results are returned by e-mail.
vsvs Clustal – room for Clustal – room for improvementimprovement
Gaps are consistent with the phylogenetic tree
CLUSTALW-artifacts?CLUSTALW-artifacts?
Gaps are largely inconsistent with the phylogenetic tree
Misc LinksMisc Links
http://hits.isb-sib.ch/cgi-bin/PFSCANhttp://hits.isb-sib.ch/cgi-bin/PFSCAN
http://www.soe.ucsc.edu/research/http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-compbio/HMM-apps/HMM-applications.htmlapplications.html
http://server1-kimlab.stanford.edu/cgi-http://server1-kimlab.stanford.edu/cgi-bin/index.cgi?BigFigures+ZahnFig4bin/index.cgi?BigFigures+ZahnFig4
Misc LinksMisc Links
CAMDA competition --- CAMDA competition --- http://www.camda.duke.edu/
Baylor college Sequencing CenterBaylor college Sequencing Center http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/
projects/rmacaque/projects/rmacaque/ http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/
projects/chimpanzee/ projects/chimpanzee/