Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Dr. R. Sankar, BSE 633 (2020)
Global vs Local alignment Global: •Matched regions are long
•Cover most of the two aligning sequences
•Depend on the presence of many gaps
•Negative mismatch scores and gap penalties are deliberately chosen to be small in comparison with match score
•Score will be proportional to the length of the sequence
Local: •Tend to be shorter; does not include many gaps
•A negative mismatch score and gap penalties are chosen to balance the positive score of a match
•This prevents the alignments from growing into regions that do not match well
Dr. R. Sankar, BSE 633 (2020)
Limitations
Computationally expensive
Sequences having two or more matching regions that have intervening regions that do not match well (Smith-Waterman)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Margaret Dayhoff (1925-1983)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Amino acid substitution matrices
Certain amino acid substitutions commonly occur in related proteins from different species
These substituted amino acids are compatible with protein structure and function
Types of amino acid changes that are most and least common in a large number of proteins can assist with predicting alignments for any set of protein sequences
Amino acid substitution matrices are used for such purposes
Dr. R. Sankar, BSE 633 (2020)
Seq #1: Y C D A
Seq #2: F M E G
3 -1 2 0
Total score = 3-1+2+0 = 4
Alignment with gaps
Each value represents odd scores
Odd scores are probabilities; they are multiplied to give an overall odd scores
For convenience, odd scores are converted to log odd scores so that the values can be summed
Dr. R. Sankar, BSE 633 (2020)
Amino acid substitution matrix: An example
Dr. R. Sankar, BSE 633 (2020)
Amino acid substitution matrices
• Each matrix position is filled with a score
• Scores reflect how often one AA would have been paired with the other in an alignment of related protein sequences
• Probability (A B) = Probability (B A)
• Likelihood of replacement depends on 1. The product of the frequency of occurrence of two AAs
2. Their chemical and physical properties
Dr. R. Sankar, BSE 633 (2020)
Scoring matrices
PAM (Percent Accepted Mutation) Matrices
BLOSUM (Blocks Amino Acid Substitution Matrices)
Dr. R. Sankar, BSE 633 (2020)
Percent Accepted Mutation (PAM) Matrices
Based on evolutionary principles; family of matrices
One matrix gives the AA changes expected in homologous proteins that have diverged only a small amount from each other in a relatively short period of time; still 50% or more similar
Another gives AA changes that have diverged over a much longer period; < 20% similarity
Predicted changes are used to score the alignment and produce an optimal alignment
Dr. R. Sankar, BSE 633 (2020)
PAM1: A PAM unit is a time period over which 1% of amino acids in a sequence are expected to undergo accepted mutations some of which may occur in the same position
PAM Units in PAM matrices
The two sequences are 100 PAM Unit diverged does not mean that they are different in every position
Dr. R. Sankar, BSE 633 (2020)
Percent Accepted Mutation (PAM) Matrices
Assumptions
Each change in the current AA at a particular site is assumed to be independent of previous mutational events at that site
AA substitutions are viewed as a Markov model
A series of changes of state in a system such that a change from one state to another does not depend on the previous history of the state
AA substitutions observed over short periods of evolutionary history can be extrapolated to longer distances
Dr. R. Sankar, BSE 633 (2020)
PAM Matrices (Dayhoff Matrices)
71 groups of protein sequences; at least 85% similar
1572 changes Changes were observed in closely related proteins
They do not significantly change the function
“accepted” by natural selection and hence “accepted mutations”
Number of changes of each AA
Relative amount of change in each AA
Normalizes the data for variations in AA composition, mutation rate and sequence length
Dr. R. Sankar, BSE 633 (2020)
• Construct 71 phylogenetic trees of protein
families • Observe amino acid substitutions on each
branch of tree • Also need probability of occurrence for each
amino acid (pa)
PAM Matrix Construction
Slide courtesy: Chris Bailey ([email protected])
Dr. R. Sankar, BSE 633 (2020)
Construct a multiple sequence alignment
Create a phylogenetic tree from the alignment
PAM Matrix construction
Step 1
Step 2
Dr. R. Sankar, BSE 633 (2020)
• Using substitution data calculate fab the observed frequency of the mutation a ↔ b
• Also note that fab = fba
• Using this information calculate fa, the total number of mutations in which a involved
PAM Matrix Construction
∑≠
=ab
aba ff
Dr. R. Sankar, BSE 633 (2020)
Step 3
For each amino acid type, the frequency with which it is substituted by each other amino acid
A G = G A
F(G,A) = 3
PAM Matrix construction
Dr. R. Sankar, BSE 633 (2020)
• And also calculate f, the total occurences of amino acid substitutions
• From here we go on to calculate relative mutability:
∑=a
aff
a
aa
pffm
100=
PAM Matrix Construction
Dr. R. Sankar, BSE 633 (2020)
Step 4: Relative mutability of Ala
Number of mutations in which Ala = 4 is substituted
Total number of mutations = 6 x 2 = 12
Relative frequency of Ala = 10/63 = 0.159
Relative mutability = 4/(12 x 0.159 x 100) = 0.0209
Dr. R. Sankar, BSE 633 (2020)
Relative mutability of residue a
• How much the residue a likes to change compared to other residues
• Relative mutability: Probability that a given amino acid will change in the evolutionary period of interest
Dr. R. Sankar, BSE 633 (2020)
• 20 x 20 Matrix where Mab is the probability of amino acid a changing into amino acid b
• Maa = 1 – ma • Mab is more complicated & requires
conditional probability – E.g. P(A and B) = P(A)∙P(B|A)
PAM Matrix Construction
Dr. R. Sankar, BSE 633 (2020)
• In this case:
• Or:
changed) ()changed |( aPabaPMab →=
aa
abab m
ffM =
PAM Matrix Construction
Dr. R. Sankar, BSE 633 (2020)
Step 5: Mutation probability for each pair of a.a
Relative mutability of Ala = 0.0209
Frequency of Ala Gly = 3
Frequency of all a.a pairs in which Ala is substituted = 4
Mutation probability of Ala Gly:
MA,G = (0.0209 * 3 )/4 = 0.0156
Dr. R. Sankar, BSE 633 (2020)
• The final scores in a PAM matrix are expressed as a lod (logarithm of odds) score
• Compare probability of mutation vs probability of random occurrence
• Gives odds ratio:
• Scoring Matrix S is calculated by: b
ab
pM
=
b
abab
pMS 10log10
PAM Matrix Construction
Dr. R. Sankar, BSE 633 (2020)
• These equations allow us to calculate a PAM1 matrix
• The number after PAM is the number of amino acid substitutions per 100 residues: – PAM40 – 40 substitutions per 100 residues – PAM250 – 250 substitutions per 100 residues
• All matrices calculated by multiplication of
PAM1 matrix
PAM Matrix Construction
Dr. R. Sankar, BSE 633 (2020)
1. For each amino acid type, the frequency with which it is substituted by each other amino acid
2. Relative mutatability, mi, of each amino acid
3. Mutation probability Mij
4. Divide Mij by frequency of occurrence, fi, of residue i
5. Take the log of these values
6. Fill the matrix Rij (off-diagonal entries)
7. Diagonal entries: Mjj = 1 – mj
8. Follow steps 4 to 6 to fill diagonal entries
PAM Matrix construction
Dr. R. Sankar, BSE 633 (2020) top row shows original amino acid; left column shows replacement amino acid
PAM1 evolutionary distance
Dr. R. Sankar, BSE 633 (2020)
top row shows original amino acid; left column shows replacement amino acid
(Adapted from Figure 82. Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)
•Mutation probability matrix for the evolutionary distance of 1 PAM (i.e., one Accepted Point Mutation per 100 amino acids).
•An element of this matrix, Mi,j , gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval,
•in this case 1 PAM. Thus, there is a 0.56% probability that Asp will be replaced by Glu. To simplify the appearance, the elements are shown multiplied by 10,000.
Dr. R. Sankar, BSE 633 (2020)
PAM250 evolutionary distance
Dr. R. Sankar, BSE 633 (2020)
(Adapted from Figure 83. Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)
Mutation probability matrix for the evolutionary distance of 250 PAMs. To simplify the appearance, the elements are shown multiplied by 100.
In comparing two sequences of average amino acid frequency at this evolutionary distance, there is a 13% probability that a position containing Ala in the first sequence will contain Ala in the second.
There is a 3% chance that it will contain Arg, and so forth. The relationship of two sequences at a distance of 250 PAMs can be demonstrated by statistical methods.
Dr. R. Sankar, BSE 633 (2020)
http://biomath.geneseo.edu/symposium/talks/darling.pdf
http://www.cse.ucsd.edu/classes/sp04/cse182/slides/L6.pdf
Dr. R. Sankar, BSE 633 (2020)
Protein A: amino acid sequence
a1a2a3a4…(time t)
Assume that 1% of all amino acids in protein A have undergone substitutions at time t + t’
New amino acid sequence at t + t’
b1b2b3b4…(call this sequence as Protein A’)
PAM1: The probability that a residue of type i in protein A will be replaced by j in the protein A’
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
The relative mutability of amino acids
Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18
Note that alanine is normalized to a value of 100. Trp and cys are least mutable. Asn and ser are most mutable.
Dr. R. Sankar, BSE 633 (2020)
Normalized frequencies of amino acids: variations in frequency of occurrence
Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon;
note: should be 5% for each if equally distributed
Dr. R. Sankar, BSE 633 (2020)
Comparing two proteins with a PAM1 matrix gives completely different results than PAM250!
Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match.
A PAM250 matrix is very tolerant of mismatches.
hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * **
24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI ** * ** **
Dr. R. Sankar, BSE 633 (2020)
PAM 250 matrix
Dr. R. Sankar, BSE 633 (2020)
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Amino acid frequencies
bioinfo.mbb.yale.edu/course/ classes/c10/AAS-matrices3.html
Dr. R. Sankar, BSE 633 (2020)
1978 1991
A 100 100
C 20 44
D 106 86
E 102 77
F 41 51
G 49 50
H 66 91
I 96 103
K 56 72
L 40 54
M 94 93
N 134 104
P 56 58
Q 93 84
R 65 83
S 120 117
T 97 107
V 74 98
W 18 25
Y 41 50
Relative mutabilities of amino acids:
http://bioinfo.mbb.yale.edu/course/ classes/c10/AAS-matrices3.html
Dr. R. Sankar, BSE 633 (2020)
-the prob of a mutation is independent of the history of the sequence
-the prob of a mutation is independent of the position within the sequence
-the prob of a mutation at any position is independent of the rest of the sequence
-all positions within the sequence mutate at the same rate
-long term substitution patterns can be extrapolated from short term
-AA distributions in the set of protein families used to make the scoring matrix are representative of all the families that are likely to be encountered
Implicit assumptions in PAM matrix calculations
Dr. R. Sankar, BSE 633 (2020)
Sources of Error in PAM
Dr. R. Sankar, BSE 633 (2020)
Dr. R. Sankar, BSE 633 (2020)
Go to UniProt database (www.uniprot.org)
Extract the following hemoglobin sequences with accession ids:
(a) P69905, (b) P68871, (c) O04985
Use the pairwise alignment tools Needle and Water available at the URL http://www.ebi.ac.uk/Tools/psa/ with all default parameters.
•Align sequences (a) and (b) using both Needle and Water
•What do you observe in the pairwise alignments?
•What is your conclusion?
•Align sequences (a) and (c) using both Needle and Water
•What do you observe in the pairwise alignments
•What is your conclusion?
Repeat the above exercise with different PAM matrices (PAM50, PAM100, PAM250)
Compare the output you get for lower and higher PAM matrices
Exercise