Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Genome 559: Introduction to Statistical and Computational Genomics

Prof. James H. Thomas

http://faculty.washington.edu/jht/GS559_2012/

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those nodes. We can reach node a by three possible paths: an A-B match, a gap in sequence A or a gap in sequence B:

seq A

seq B

X Y

Z

match gap

gap

The best-scoring path to a is the maximum of:

X + matchY + gapZ + gap

FYI - informal inductive proof of best alignment path

Local alignment - review

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

0 0 0 0

A 0 2 0

G 0 0 0 4

C 0 0 0 0 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

0

2

(no arrow means no preceding alignment)

d = -5

Local alignment

• Two differences from global alignment:– If a score is negative, replace with 0.– Traceback from the highest score in the

matrix and continue until you reach 0.• Global alignment algorithm: Needleman-Wunsch.

• Local alignment algorithm: Smith-Waterman.

dot plot of two DNAsequences

overlay of the globalDP alignment path

• Quantitatively represent the degree of conservation of typical amino acid residues over evolutionary time.

• All possible amino acid changes are represented (matrix of size at least 20 x 20).

• Most commonly used are several different BLOSUM matrices derived for different degrees of evolutionary divergence.

• DNA score matrices are simpler (and conceptually similar).

Protein score matrices

regular 20 amino acidsBLOSUM62 Score Matrix

ambiguity codes and stop

# B

LOS

UM

Clu

ster

ed S

corin

g M

atrix

in 1

/2 B

it U

nits

# C

lust

er P

erce

ntag

e:

>=

62

Amino acid structures

alanine A

valine V

glycine G

leucine

isoleucine

methionine M

proline P

L

I

CH CH3

C

N

.

CH H

C

N

.

CH

C

N

CH3

CH3

.

CH C

C

N

C

CH3

CH3

.

CH

C

N CH3

CH3

.

CH C

C

N

C S CH3

.

CH

N

C

.

tryptophan W CH

C

N

.

HN

.

threonine T

tyrosine Y

serine S

asparagine

glutamine

N

Q

cysteine C

CH

C

N

.OH

.

CH

C

N

.SH

.

CH

C

N

.OH

.

CH

C

N

C OH

.

CH

C

N

.NH2

O.

CH

C

N

.

.NH2

O

lysine K

arginine R

histidine H

aspartate

glutamate

D

E

CH

C

N

.

NN +

.

CH

C

N

NH3+

.

CH

C

N

.NH

NH2+

H2N

.

CH

C

N

C

.O-

O.

CH

C

N

.

.O-

O

.

Hydrophobic

Polar Charged

phenylalanine F

BLOSUM62 Score Matrix

Good scores – chemically similar

Bad scores – chemically dissimilar

Amino acid structures

alanine A

valine V

glycine G

leucine

isoleucine

methionine M

proline P

L

I

CH CH3

C

N

.

CH H

C

N

.

CH

C

N

CH3

CH3

.

CH C

C

N

C

CH3

CH3

.

CH

C

N CH3

CH3

.

CH C

C

N

C S CH3

.

CH

N

C

.

tryptophan W CH

C

N

.

HN

.

threonine T

tyrosine Y

serine S

asparagine

glutamine

N

Q

cysteine C

CH

C

N

.OH

.

CH

C

N

.SH

.

CH

C

N

.OH

.

CH

C

N

C OH

.

CH

C

N

.NH2

O.

CH

C

N

.

.NH2

O

lysine K

arginine R

histidine H

aspartate

glutamate

D

E

CH

C

N

.

NN +

.

CH

C

N

NH3+

.

CH

C

N

.NH

NH2+

H2N

.

CH

C

N

C

.O-

O.

CH

C

N

.

.O-

O

.

Hydrophobic Polar Charged

• Find sets of sequences whose alignment is thought to be correct (this is partly bootstrapped by alignment).

• Measure how often various amino acid pairs occur in the alignments.

• Normalize this to the expected frequency of such pairs randomly in the same set of alignments.

• Derive a log-odds score for aligned vs. random.

Deriving BLOSUM scores

Example of alignment block (the BLO part of

BLOSUM)

31 positions (columns)61 sequences (rows)

• Thousands of such blocks go into computing a single BLOSUM matrix.

• Represent full diversity of sequences.

• Results are summed over all columns of all blocks.

Pair frequency vs. expectation

DEDNDD

6 D-D pairs4 D-E pairs4 D-N pairs1 E-N pair

Sample column from an alignment block:

where is the count of pairs and is the total pair count.

1

ij

ij ij

c ijT

q cT

Actual aligned pair frequency:

where and are the overall probabilities(frequencies) of specific residues and .

2

a b

aa a a

ab a b b a a b

p pa b

e p p

e p p p p p p

Randomly expected pair frequency:

(a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.)

etc.

this is called the sum of pairs (the

SUM part of BLOSUM)

Log-odds score calculation (so adding scores == multiplying probabilities)

2log ijij

ij

qs

e

For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score:

2matrixScore (rounded) 2log ij

ij

qe

(computers can add integers faster than floats)

BLOSUM62 matrix (half-bit scores)

Frequency of C residue over all proteins: 0.0162(you have to look this up)

C-C

Reverse calculation of aligned C-C pair frequency in BLOSUM data set:

63.222 5.4 cc

cc

e

q

00594.0000262.063.22 ccq

000262.00162.00162.0 cce

thus

( 9 half-bits = 4.5 bits )

Constructing Blocks

• Blocks are ungapped alignments of multiple sequences, usually 20 to 100 amino acids long.

• Cluster the members of each block according to their percent identity.

• Make pair counts and score matrix from a large collection of similarly clustered blocks.

• Each BLOSUM matrix is named for the percent identity cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

Probabilistic Interpretation of Scores (ungapped)

• By converting scores back to probabilities, we can give a probabilistic interpretation to an alignment score.

VHRDLKPENLLLASKVHRDLKPENLLLASK(4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)

• this 15 amino acid alignment has a score of 75, meaning that it is ~1011 times more likely to be seen in a real alignment than in a random alignment(!!).

FIAPFLSP

• this alignment has a score of 16 (6+2+1+7) by BLOSUM 62, meaning an alignment with this score or more is 28 (256) times more likely than expected from a random alignment.

(BLOSUM62)2matrixScore (rounded) 2log ij

ij

qe

Randomly Distributed Gaps

(probability of a gap at each position in the sequence)

[note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]

nn

g

kgPkgPkgP

kp

)(,...,)(,)( 221

if

then

log-linear plot

Distribution of real alignment gap lengths in large set of structurally-aligned proteins

Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

What you should know

• How a score matrix is derived

• What the scores mean probablistically

• Why gap penalties should be affine

Documents

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas