Upload
katherine-sanders
View
218
Download
0
Embed Size (px)
Citation preview
Sequence comparison: Score matrices
Genome 559: Introduction to Statistical and Computational Genomics
Prof. James H. Thomas
http://faculty.washington.edu/jht/GS559_2012/
BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.
Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node a by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B:
seq A
seq B
X Y
Z
match gap
gap
The best-scoring path to a is the maximum of:
X + matchY + gapZ + gap
FYI - informal inductive proof of best alignment path
Local alignment - review
A C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
0 0 0 0
A 0 2 0
G 0 0 0 4
C 0 0 0 0 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
0
2
(no arrow means no preceding alignment)
d = -5
Local alignment
• Two differences from global alignment:– If a score is negative, replace with 0.– Traceback from the highest score in the
matrix and continue until you reach 0.• Global alignment algorithm: Needleman-Wunsch.
• Local alignment algorithm: Smith-Waterman.
dot plot of two DNAsequences
overlay of the globalDP alignment path
• Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time.
• All possible amino acid changes are represented (matrix of size at least 20 x 20).
• Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence.
• DNA score matrices are simpler (and conceptually similar).
Protein score matrices
regular 20 amino acidsBLOSUM62 Score Matrix
ambiguity codes and stop
# B
LOS
UM
Clu
ster
ed S
corin
g M
atrix
in 1
/2 B
it U
nits
# C
lust
er P
erce
ntag
e:
>=
62
Amino acid structures
alanine A
valine V
glycine G
leucine
isoleucine
methionine M
proline P
L
I
CH CH3
C
N
.
CH H
C
N
.
CH
C
N
CH3
CH3
.
CH C
C
N
C
CH3
CH3
.
CH
C
N CH3
CH3
.
CH C
C
N
C S CH3
.
CH
N
C
.
tryptophan W CH
C
N
.
HN
.
threonine T
tyrosine Y
serine S
asparagine
glutamine
N
Q
cysteine C
CH
C
N
.OH
.
CH
C
N
.SH
.
CH
C
N
.OH
.
CH
C
N
C OH
.
CH
C
N
.NH2
O.
CH
C
N
.
.NH2
O
lysine K
arginine R
histidine H
aspartate
glutamate
D
E
CH
C
N
.
NN +
.
CH
C
N
NH3+
.
CH
C
N
.NH
NH2+
H2N
.
CH
C
N
C
.O-
O.
CH
C
N
.
.O-
O
.
Hydrophobic
Polar Charged
phenylalanine F
BLOSUM62 Score Matrix
Good scores – chemically similar
Bad scores – chemically dissimilar
Amino acid structures
alanine A
valine V
glycine G
leucine
isoleucine
methionine M
proline P
L
I
CH CH3
C
N
.
CH H
C
N
.
CH
C
N
CH3
CH3
.
CH C
C
N
C
CH3
CH3
.
CH
C
N CH3
CH3
.
CH C
C
N
C S CH3
.
CH
N
C
.
tryptophan W CH
C
N
.
HN
.
threonine T
tyrosine Y
serine S
asparagine
glutamine
N
Q
cysteine C
CH
C
N
.OH
.
CH
C
N
.SH
.
CH
C
N
.OH
.
CH
C
N
C OH
.
CH
C
N
.NH2
O.
CH
C
N
.
.NH2
O
lysine K
arginine R
histidine H
aspartate
glutamate
D
E
CH
C
N
.
NN +
.
CH
C
N
NH3+
.
CH
C
N
.NH
NH2+
H2N
.
CH
C
N
C
.O-
O.
CH
C
N
.
.O-
O
.
Hydrophobic Polar Charged
• Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment).
• Measure how often various amino acid pairs occur in the alignments.
• Normalize this to the expected frequency of such pairs randomly in the same set of alignments.
• Derive a log-odds score for aligned vs. random.
Deriving BLOSUM scores
Example of alignment block (the BLO part of
BLOSUM)
31 positions (columns)61 sequences (rows)
• Thousands of such blocks go into computing a single BLOSUM matrix.
• Represent full diversity of sequences.
• Results are summed over all columns of all blocks.
Pair frequency vs. expectation
DEDNDD
6 D-D pairs4 D-E pairs4 D-N pairs1 E-N pair
Sample column from an alignment block:
where is the count of pairs and is the total pair count.
1
ij
ij ij
c ijT
q cT
Actual aligned pair frequency:
where and are the overall probabilities(frequencies) of specific residues and .
2
a b
aa a a
ab a b b a a b
p pa b
e p p
e p p p p p p
Randomly expected pair frequency:
(a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.)
etc.
this is called the sum of pairs (the
SUM part of BLOSUM)
Log-odds score calculation (so adding scores == multiplying probabilities)
2log ijij
ij
qs
e
For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score:
2matrixScore (rounded) 2log ij
ij
qe
(computers can add integers faster than floats)
BLOSUM62 matrix (half-bit scores)
Frequency of C residue over all proteins: 0.0162(you have to look this up)
C-C
Reverse calculation of aligned C-C pair frequency in BLOSUM data set:
63.222 5.4 cc
cc
e
q
00594.0000262.063.22 ccq
000262.00162.00162.0 cce
thus
( 9 half-bits = 4.5 bits )
Constructing Blocks
• Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long.
• Cluster the members of each block according to their percent identity.
• Make pair counts and score matrix from a large collection of similarly clustered blocks.
• Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).
Probabilistic Interpretation of Scores (ungapped)
• By converting scores back to probabilities, we can give a probabilistic interpretation to an alignment score.
VHRDLKPENLLLASKVHRDLKPENLLLASK(4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)
• this 15 amino acid alignment has a score of 75, meaning that it is ~1011 times more likely to be seen in a real alignment than in a random alignment(!!).
FIAPFLSP
• this alignment has a score of 16 (6+2+1+7) by BLOSUM 62, meaning an alignment with this score or more is 28 (256) times more likely than expected from a random alignment.
(BLOSUM62)2matrixScore (rounded) 2log ij
ij
qe
Randomly Distributed Gaps
(probability of a gap at each position in the sequence)
[note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]
nn
g
kgPkgPkgP
kp
)(,...,)(,)( 221
if
then
log-linear plot
Distribution of real alignment gap lengths in large set of structurally-aligned proteins
Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)
What you should know
• How a score matrix is derived
• What the scores mean probablistically
• Why gap penalties should be affine