Upload
ferdinand-norton
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
04/21/23 1
Pairwise sequence Alignment
04/21/23 2
Many of the images in this power point presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
Copyright notice
04/21/23 3
• It is used to decide if two proteins (or genes) are related structurally or functionally
• It is used to identify domains or motifs that are shared between proteins
• It is the basis of BLAST searching
• It is used in the analysis of genomes
Pairwise sequence alignment is the most fundamental operation of bioinformatics
04/21/23 4
04/21/23 5
Pairwise alignment: protein sequencescan be more informative than DNA
• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties
• codons are degenerate: changes in the third position often do not alter the amino acid that is specified
• DNA sequences can be translated into protein, and then used in pairwise alignments
04/21/23 6
• DNA can be translated into six potential proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT
5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
Pairwise alignment: protein sequencescan be more informative than DNA
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
04/21/23 7
Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
Definitions
04/21/23 8
HomologySimilarity attributed to descent from a common ancestor.
Definitions
Page 44retinol-binding protein (rbp)
(NP_006735)-lactoglobulin
(P02754)
04/21/23 9
HomologySimilarity attributed to descent from a common ancestor.
Definitions
Identity
The extent to which two (nucleotide or amino acid) sequences are invariant.
Page 44
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
04/21/23 10
Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.
Paralogs Homologous sequences within a single species that arose by gene duplication.
Definitions: two types of homology
04/21/23 11
04/21/23 12
Definitions• Similarity: The extent to which nucleotide or protein
sequences are related.
• Identity: The extent to which two sequences are invariant.
• Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.
• Percent Similarity of two protein sequence: Identity + Conservation.
04/21/23 13
04/21/23 14
Conservation Substitutions
• Basic Amino Acid (K, R, H)
• Acidic Amino Acid (D, E)
• Hydroxylated Amino Acid (S, T)
• Hydrophobic Amino Acid (W, F, Y, L, I, V, M, A)
Pairwise alignment of retinol-binding protein and -lactoglobulin
04/21/23 15
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Identity(bar)
Pairwise alignment of retinol-binding protein and -lactoglobulin
04/21/23 16
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Somewhatsimilar
(one dot)
Verysimilar
(two dots)
04/21/23 17
Gaps
• Common Mutations: substitutions, insertions, deletions.
• Insertions and deletion lead to a Gap in the alignment: Positions at which a letter is paired with a null are called gaps. – Gap scores are typically negative. – Since a single mutational event may cause the
insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.
– In BLAST, it is rarely necessary to change gap values from the default.
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
04/21/23 18
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and -lactoglobulin
Internalgap
Terminalgap
04/21/23 19
General approach to pairwise alignment
• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignment occurred by chance
04/21/23 20
Key Issues in Pairwise Alignment
• The scoring system
• Types of alignments (local vs. global)
• The alignment algorithm
• Measuring alignment significance
04/21/23 21
An alignment scoring system is required to evaluate how good an alignment is
• positive and negative values assigned
• gap creation and extension penalties
• positive score for identities
• some partial positive score for conservative substitutions
• use of a substitution matrix
04/21/23 22
Scoring systems
• Match scores: s(a,b) assigns a score to each combination of aligned letters. Examples: PAM, Blosum.
• Gap score: f(g) assigns a score to a gap of length g. Examples: linear, affine.
• Scoring systems are usually additive: the total score is the sum of the substitution scores and all the gap scores.
04/21/23 23
Gap Scores
• Gap scores are negative (they are “costs”)
• Linear: gap score is proportional to length of gap. f(g) = -g · d
• Affine: gap score is a constant plus a linear score. With affine scores, many small gaps cost more than one large gap. f(g) = -d -(g-1)e, for g>0
04/21/23 24
Calculation of an alignment score
Match Score Tables
• Match scores are computed using a substitution matrix
Log-odds Match Scores
• Match scores are usually log-odds scores.
s(a,b) = log(Pr(a, b | M) / Pr(a, b | R))• Pr(a, b | M) = pab is the probability of the residues
a and b appearing assuming the correct positions are aligned and the sequences descended from a common ancestor.
• Pr(a, b | R) is the probability of a and b if we pick two random sequence positions to align. Assuming independence (null model) gives Pr(a, b | R)=qaqb.
Adding log-odds substitution scores gives the log-odds of the alignment• For ungapped
alignments, the log-odds alignment score of two sequence segments x and y is equal to the log-odds ratio of the segments.
• Let x = x1 x2 ...xn and y = y1 y2 ...yn. )|,Pr(
)|,Pr(log
log
log
),(
Ryx
Myx
p
p
yxsS
i yx
yx
yx
yx
i
ii
i
ii
ii
ii
ii
A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.
Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments(or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.
The two major types of substitution matrices arePAM and BLOSUM.
Substitution Matrix
04/21/23 29
Dayhoff Model: Accepted Point mutations
• Accepted Point Mutations (PAM): a replacement of one amino acid in protein by another residue that has been accepted by natural selection.
• When PAM occurs:– A gene undergoes a DNA mutation such that
it encodes a different amino acid– The entire species adopts that changes as the
predominant form.
04/21/23 30
Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million yearsIg kappa chain 37Kappa casein 33Lactalbumin 27Hemoglobin 12Myoglobin 8.9Insulin 4.4Histone H4 0.10Ubiquitin 0.00
04/21/23 31
A Ala
R Arg
N Asn
D Asp
C Cys
Q Gln
E Glu
G Gly
A
R 30
N 109 17
D 154 0 532
C 33 10 0 0
Q 93 120 50 76 0
E 266 0 94 831 0 422
G 579 10 156 162 10 30 112
H 21 103 226 43 10 243 23 10
Dayhoff’s numbers of “accepted point mutations”:what amino acid substitutions occur in proteins?
Number of Accepted point mutations, multiplied by 10, in 1572 cases
04/21/23 32
Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined.
Relative mutability [changes] / [occurrences]
Example:
sequence 1 ala his val alasequence 2 ala arg ser val
For ala, relative mutability = [1] / [3] = 0.33For val, relative mutability = [2] / [2] = 1.0
The relative mutability of amino acids
04/21/23 33
The relative mutability of amino acids
Asn 134 His 66Ser 120 Arg 65Asp 106 Lys 56Glu 102 Pro 56Ala 100 Gly 49Thr 97 Tyr 41Ile 96 Phe 41Met 94 Leu 40Gln 93 Cys 20Val 74 Trp 18
04/21/23 34
Normalized frequencies of amino acids
Gly 8.9% Arg 4.1%Ala 8.7% Asn 4.0%Leu 8.5% Phe 4.0%Lys 8.1% Gln 3.8%Ser 7.0% Ile 3.7%Val 6.5% His 3.4%Thr 5.8% Cys 3.3%Pro 5.1% Tyr 3.0%Glu 5.0% Met 1.5%Asp 4.7% Trp 1.0%
blue=6 codons; red=1 codon
These values sum to 1.0.
04/21/23 35
PAM matrices are based on global alignments of closely related proteins.
Other PAM matrices are extrapolated from PAM1.
All the PAM data come from closely related proteins(>85% amino acid identity)
PAM matrices:Accepted point mutations
04/21/23 36
Dayhoff’s PAM1
• PAM1: calculated from comparisons of sequences with no more than 1% divergence. – Defined as evolutionary interval.– 1 PAM = PAM1 = 1% average change of all amino
acid positions• Value in PAM1: for one amino acid
– number of “accepted point mutations” × relative mutability × fraction of change from one amino acid to another amino acid over all changes of one amino acid to any other amino acid.
– Normalization over all 20 changes.
04/21/23 37
Dayhoff’s PAM1 mutation probability matrix
A Ala
R Arg
N Asn
D Asp
C Cys
Q Gln
E Glu
G Gly
H His
I Ile
A 9867 2 9 10 3 8 17 21 2 6
R 1 9913 1 0 1 10 0 0 10 3
N 4 1 9822 36 0 4 6 6 21 3
D 6 0 42 9859 0 6 53 6 4 1
C 1 1 0 0 9973 0 0 0 1 1
Q 3 9 4 5 0 9876 27 1 23 1
E 10 0 7 56 0 35 9865 4 2 3
G 21 1 12 11 1 3 7 9935 1 0
H 1 8 18 3 1 20 1 0 9912 0
I 2 2 3 1 2 1 2 0 0 9872
Each element of the matrix shows the probability that an originalamino acid (top) will be replaced by another amino acid (side)
PAM
• After 100 PAMs of evolution, not every residue will have changed– some residues may have mutated several
times– some residues may have returned to their
original state– some residues may not changed at all
04/21/23 39
Dayhoff’s PAM0 mutation probability matrix:the rules for extremely slowly evolving proteins
PAM0 AAla
RArg
NAsn
DAsp
CCys
QGln
EGlu
GGly
A 100% 0% 0% 0% 0% 0% 0% 0%R 0% 100% 0% 0% 0% 0% 0% 0%N 0% 0% 100% 0% 0% 0% 0% 0%D 0% 0% 0% 100% 0% 0% 0% 0%C 0% 0% 0% 0% 100% 0% 0% 0%Q 0% 0% 0% 0% 0% 100% 0% 0%E 0% 0% 0% 0% 0% 0% 100% 0%G 0% 0% 0% 0% 0% 0% 0% 100%
Top: original amino acidSide: replacement amino acid
04/21/23 40
Dayhoff’s PAM2000 mutation probability matrix:the rules for very distantly related proteins
PAM AAla
RArg
NAsn
DAsp
CCys
QGln
EGlu
GGly
A 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7% 8.7%R 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1% 4.1%N 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%D 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7% 4.7%C 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3% 3.3%Q 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8% 3.8%E 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0% 5.0%G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9%
Top: original amino acidSide: replacement amino acid
04/21/23 41
PAM250 Matrix
• Commonly used
• Describes the frequency of amino acid replacement between distantly related proteins
• An evolutionary distance where proteins share about 20% amino acid identity
04/21/23 42
PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9
R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2
N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3
D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3
C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2
Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3
E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3
G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7
H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2
I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9
L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13
K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5
M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2
F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3
P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4
S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6
T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6
W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0
Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2
V 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17
Top: original amino acidSide: replacement amino acid
04/21/23 43
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V
PAM250 log oddsscoring matrix
04/21/23 44
How do we go from a mutation probabilitymatrix to a log odds matrix?
• The cells in a log odds matrix consist of an “odds ratio”:
the probability that an alignment is authenticthe probability that the alignment was random
The score S for an alignment of residues a,b is given by:
S(a,b) = 10 log10 (Mab/pb)
As an example, for tryptophan,
S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4
04/21/23 45
What do the numbers meanin a log odds matrix?
S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4
A score of +17 for tryptophan means that this alignmentis 50 times more likely than a chance alignment of twoTrp residues.
S(a,b) = 17Probability of replacement (Mab/pb) = xThen17 = 10 log10 x1.7 = log10 x101.7 = x = 50
04/21/23 46
What do the numbers meanin a log odds matrix?
A score of +2 indicates that the amino acid replacementoccurs 1.6 times as frequently as expected by chance.
A score of 0 is neutral.
A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately representshomology (evolutionary descent) is one tenth as frequentas the chance alignment of these amino acids.
04/21/23 47
Comparing two proteins with a PAM1 matrixgives completely different results than PAM250!
Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.
A PAM250 matrix is very tolerant of mismatches.
hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * **
24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** *
hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** **
04/21/23 48
04/21/23 49
PAM: “Accepted point mutation”
• Two proteins with 50% identity may have 80 changesper 100 residues. (Why? Because any residue can besubject to back mutations.)
• Proteins with 20% to 25% identity are in the “twilight zone”and may be statistically significantly related.
• PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968)
04/21/23 50
Ancestral sequence
Sequence 1ACCGATC
Sequence 2AATAATC
A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> C
ACCCTAC
Li (1997) p.70
04/21/23 51
Pe
rce
nt
ide
nti
ty
Evolutionary distance in PAMs
Two randomly diverging protein sequences changein a negatively exponential fashion
“twilight zone”
04/21/23 52
Pe
rce
nt
ide
nti
ty
Differences per 100 residues
At PAM1, two proteins are 99% identicalAt PAM10.7, there are 10 differences per 100 residuesAt PAM80, there are 50 differences per 100 residues
At PAM250, there are 80 differences per 100 residues
“twilight zone”
04/21/23 53
PAM matrices reflect different degrees of divergence
PAM250
04/21/23 54
PAM250
• In the “twilight zone”
• At this level of divergence, it is difficulty to assess whether the two proteins are homologous.
• Other techniques may use– Multiple sequence alignment– Structural alignment.
04/21/23 55
Comments
• Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences.
• Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales.
04/21/23 56
Henikoff and Henikoff: BLOSUM
• BLOSUM: Block Substitution Matrices– constructed these matrices using multiple alignments
of evolutionarily divergent proteins. – The probabilities used in the matrix calculation are
computed by looking at "blocks" of conserved sequences found in multiple protein alignments.
– These conserved sequences are assumed to be of functional importance within related proteins.
04/21/23 57
BLOSUM Matrices
• The BLOCKS database contains thousands of groups of multiple sequence alignments.
• The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more – the BLOSUM100 matrix is calculated from alignments between proteins
showing 100% identity the proteins in the BLOCKS database – One would use a higher numbered BLOSUM matrix for aligning two
closely related sequences and a lower number for more divergent sequences.
• BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.
04/21/23 58
BLOSUM Matrices
100
62
30
Per
cent
am
ino
acid
iden
tity
BLOSUM62
colla
pse
04/21/23 59
BLOSUM Matrices
100
62
30
Per
cent
am
ino
acid
iden
tity
BLOSUM62
100
62
30
BLOSUM30
100
62
30
BLOSUM80
colla
pse
colla
pse
colla
pse
04/21/23 60
A 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Blosum62 scoring matrix
04/21/23 61
Differences between PAM and BLOSUM
• PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution.
• The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
• The method used to count the replacements is different, unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
• Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than Blosum50.
http://en.wikipedia.org/wiki/Substitution_matrix
04/21/23 62
Rat versus mouse RBP
Rat versus bacteriallipocalin
04/21/23 63
Measuring Alignment Significance
• The statistical significance of a an alignment score is used to try to determine if an alignment is the result of homology or just random chance.
04/21/23 64
True positives False positives
False negatives
Sequences reportedas related
Sequences reportedas unrelated
True negatives
04/21/23 65
True positives False positives
False negatives
Sequences reportedas related
Sequences reportedas unrelated
True negatives
homologoussequences
non-homologoussequences
04/21/23 66
True positives False positives
False negatives
Sequences reportedas related
Sequences reportedas unrelated
True negatives
homologoussequences
non-homologoussequences
Sensitivity:ability to findtrue positives
Specificity:ability to minimize
false positives
04/21/23 67
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
Randomization test: scramble a sequence
• First compare two proteins and obtain a score
• Next scramble the bottom sequence 100 times,and obtain 100 “randomized” scores (+/- S.D. )
• Composition and length are maintained
• If the comparison is “real” we expect the authentic score to be several standard deviations above the mean of the “randomized” scores
04/21/23 68
0
2
4
6
8
10
12
14
16
1 10 19 28 37
100 random shufflesMean score = 8.7Std. dev. = 4.2
Quality score
Num
ber
of in
stan
ces
A randomization test shows that RBP is significantly related to -lactoglobulin
Real comparisonScore = 37
But this test assumes a normal distribution of scores!
04/21/23 69
For align RBP to b-lactoglobulin:
scorerandomizedrealS
Z
7.62.4
7.837
Z
04/21/23 70
The PRSS program performsa scramble test for you(http://fasta.bioch.virginia.edu/fasta/prss.htm)
Badscores
Good scores
But these scores are notnormally distributed!
04/21/23 71
We will first consider the global alignment algorithmof Needleman and Wunsch (1970).
We will then explore the local alignment algorithmof Smith and Waterman (1981).
Finally, we will consider BLAST, a heuristic versionof Smith-Waterman. We will cover BLAST in detaillater.
Two kinds of sequence alignment: global and local
04/21/23 72
• Two sequences can be compared in a matrix along x- and y-axes.
• If they are identical, a path along a diagonal can be drawn
• Find the optimal subpaths, and add them up to achieve the best score. This involves
--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)
• N-W is guaranteed to find optimal alignment(s)
Global alignment with the algorithmof Needleman and Wunsch (1970)
04/21/23 73
The Needleman and Wunsch Algorithm: Dynamic Programming
• S. Needleman and C. Wunsch were the first to apply a dynamic programming approach to the problem of sequence alignment.
• The key to understanding the dynamic programming approach to sequence alignment lies in observing how the alignment problem is broken down into sub-problems.
04/21/23 74
[1] set up a matrix
[2] score the matrix
[3] identify the optimal alignment(s)
Three steps to global alignment with the Needleman-Wunsch algorithm
04/21/23 75
Global alignment with the algorithm of Needleman and Wunsch (1970)
• Build a matrix F, index by i and j (the positions in the two sequences), where the F(i,j) is the score of the best alignment of x1..i and y1..j
04/21/23 76
Example: forward approach
• Align sequence x and y.
• F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
djiF
djiF
yxsjiF
jiF
F
ji
1,
,1
,1,1
max,
00,0
04/21/23 77
DP in equation form
1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
04/21/23 78
A simple example
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
A
G
C 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
04/21/23 79
A simple example
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
0
A
G
C 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
04/21/23 80
A simple example
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
0 -5 -10 -15
A -5
G -10
C -15 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
04/21/23 81
A simple example
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
0 -5 -10 -15
A -5 2 -3 -8
G -10 -3 -3 -1
C -15 -8 -8 -6 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
04/21/23 82
Trace back
• Start from the lower right corner and trace back to the upper left.
• Each arrow introduces one character at the end of each aligned sequence.
• A horizontal move puts a gap in the left sequence.
• A vertical move puts a gap in the top sequence.• A diagonal move uses one character from each
sequence.
04/21/23 83
• Start from the lower right corner and trace back to the upper left.
• Each arrow introduces one character at the end of each aligned sequence.
• A horizontal move puts a gap in the left sequence.
• A vertical move puts a gap in the top sequence.
• A diagonal move uses one character from each sequence.
A simple example
A A G
0 -5
A 2 -3
G -1
C -6
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
04/21/23 84
• Start from the lower right corner and trace back to the upper left.
• Each arrow introduces one character at the end of each aligned sequence.
• A horizontal move puts a gap in the left sequence.
• A vertical move puts a gap in the top sequence.
• A diagonal move uses one character from each sequence.
A simple example
A A G
0 -5
A 2 -3
G -1
C -6
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
AAG- AAG--AGC A-GC
04/21/23 85
N-W is guaranteed to find optimal alignments,although the algorithm does not search all possiblealignments.
an optimal path (alignment) is identified byincrementally extending optimal subpaths.Thus, a series of decisions is made at each step of thealignment to find the pair of residues with the best score.
Needleman-Wunsch: dynamic programming
04/21/23 86
Commercial Tools for Sequence Alignment: Accelrys Genetic Computer Group (GCG)
• Formally known as the GCG Wisconsin Package
• GCG contains over 140 programs and utilities covering the cross-disciplinary needs of today’s research environment.
• http://www.accelrys.com/products/gcg/
04/21/23 87
Global alignment versus local alignment
• Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other
• Local alignment finds optimally matching regions within two sequences (“subsequences”)
• Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences
• Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough.
04/21/23 88
Smith-Waterman Algorithm
• Similar to Needleman-Wunch, but:– Start a new alignment when encountering a
negative score– The alignment can end anywhere in the
matrix– Trace back starts from entry with highest
score in the matrix, and terminates at entry with score 0
04/21/23 89
How the Smith-Waterman algorithm works
Set up a matrix between two proteins (size m+1, n+1)
No values in the scoring matrix can be negative! S > 0
The score in each cell is the maximum of four values:[1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch)
[2] s(i,j-1) – gap penalty[3] s(i-1,j) – gap penalty[4] zero
04/21/23 90
Local alignment: example
A A G
0 0 0 0
G 0 0 0 2
A 0 2 2 0
A 0 2 4 0
G 0 0 0 6
G 0 0 0 2
C 0 0 0 0
1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal local alignment of AAG and GAAGGC.Use a gap penalty of d=-5.
0
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
04/21/23 91
Match: 1Mismatch: -1/3Gap: -4/3
04/21/23 92
Extended Smith & Waterman
To get multiple local alignments:• delete regions around best path
• repeat backtracking
04/21/23 93
Heuristic versions of Smith-Waterman: FASTA and BLAST
• Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment.
• But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database).
• Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space.
• FASTA and BLAST provide rapid alternatives to S-W
04/21/23 94
FASTA Idea
• Idea: a good alignment probably matches some identical ‘words’ (ktups)
• Example:
Database record:
ACTTGTAGATACAAAATGTG
Aligned query sequence:
A-TTGTCG-TACAA-ATCTGT
Matching words of size 4
04/21/23 95
FASTA Stage I
• A “lookup table” (Database word) is created. It consists of short stretches of amino acids. The length of a stretch is called a k-tuple.
• Upon query:– For each DB record:
• Find matching words• Search for long diagonal runs
of matching words• Finds the ten highest
scoring segments that align to the query.
*= matching word
Position in query
Position in DB record
* * * *
* * *
* * * * *
04/21/23 96
FASTA stage II, III
• II, These ten aligned regions are re-scored with a PAM or BLOSUM matrix.
• III, High-scoring segments are joined
04/21/23 97
FASTA final stage
• Apply an exact algorithm to surviving records, computing the final alignment score.– Usually, Needleman-Wunsch or Smith-
Waterman is then performed.
04/21/23 98
Advantage:
• The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer
Rapid and sensitive sequence comparison with FASTP and FASTA.Pearson WR. Methods Enzymol. 1990;183:63-98.
04/21/23 99
Limits
• Local similarity might be missed because only 10 regions saved at init1 stage.
• Non-identical conserved stretches may be overlooked