Transcript
Page 1: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information National Library of MedicineNational Institutes of Health

Page 2: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Collaborators

Yi-Kuo Yu Alejandro SchäfferJohn Wootton Richa Agarwala Mike Gertz Aleksandr Morgulis

National Center for Biotechnology InformationNational Library of MedicineNational Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS 100:15688-15693; Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.

Page 3: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Log-odds scoresThe scores of any local-alignment substitution matrix can be written in the form

where the pi are background amino acid frequencies, the qij are target frequenciesand λ is an arbitrary scale factor.(PNAS 87:2264-2268)

Page 4: Compositionally Adjusted Substitution Matrices for Protein Database Searches

The BLOSUM-62 matrixA 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

PNAS 89:10915-10919

Page 5: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Amino acid compositional bias

Some sources of bias:

Organismal bias AT-rich genome: tend to have more amino acids FLINKYM GC-rich genome: tend to have more amino acids PRAWG

Protein family bias Transmembrane proteins: more hydrophobic residues Cysteine-rich proteins: more Cysteines than usual

Page 6: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Construction of an asymmetric log-odds substitution matrix

Given a (not necessarily symmetric) set of targetfrequencies qij, define two sets of backgroundfrequencies pi and p’j as the marginal sums of the qij :

The substitution scores are then defined as

i

ijjj

iji qpqp ';

We call this matrix valid in the context of the pi and p’j.

Page 7: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Substitution matrix validity theorem

A substitution matrix can be valid for only a unique set of target and background frequencies, except in certain degenerate cases. (Proof omitted)

One can determine efficiently whether an arbitrary substitution matrix can be valid in some context and, if so, one can extract its unique target and background frequencies, and scale. (Proof and algorithms omitted)

Page 8: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Choosing new target frequenciesGiven new sets of background frequencies Pi and P’j , how should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij :

Sometimes, it is desirable to constrain the relative entropy H

Page 9: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Page 10: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Performance evaluation (mode D vrs. mode A)

Page 11: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Amino P. falciparum M. tuberculosisAcid BLOSUM62 #16805184 #15607948----- --------------- --------------- ----------------- A 7.4 4.8 13.9 R 5.2 4.1 7.4 N 4.5 8.9 2.8 D 5.3 5.6 5.9 C 2.5 2.1 1.9

Q 3.4 3.0 3.6 E 5.4 7.0 6.1 G 7.4 6.2 9.5 H 2.6 3.1 1.7 I 6.8 9.0 4.4

L 9.9 8.2 9.3 K 5.8 8.2 1.9 M 2.5 1.3 1.5 F 4.7 5.1 2.5 P 3.9 3.8 5.3

S 5.7 7.4 4.4 T 5.1 2.3 5.7 W 1.3 1.0 0.8 Y 3.2 4.6 2.8 V 7.3 4.4 8.7

BLOSUM-62 and sequence specific background frequencies

Page 12: Compositionally Adjusted Substitution Matrices for Protein Database Searches

A -15 -55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9 A R N D C Q E G H I L K M F P S T W Y V

Difference between a scaled, standard BLOSUM-62 and a compositionally adjusted BLOSUM-62

Entries shown: score of standard matrix subtracted from the adjusted one

P. falciparum

Page 13: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Optimal alignments implied by modes A and D

Mode A: 29.7 bits (H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)

Page 14: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Page 15: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Performance of various matrices on 143 pairs of related sequences (FEBS J. 272:5101-5109)

Page 16: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

Page 17: Compositionally Adjusted Substitution Matrices for Protein Database Searches

One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)

Page 18: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distance d between the compositions of the two sequences is less than 0.16.

Page 19: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Law of cosines

In a triangle with sides of length a,b and c, the angle opposite the side of length c is

Page 20: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distance d between the compositions of the two sequences is less than 0.16.

3: The angle θ made by the compositions of the two sequences with the standard composition is less than 70o.

Page 21: Compositionally Adjusted Substitution Matrices for Protein Database Searches

ROCn curves for Aravind set (NAR 29: 2994-3005)

b

Page 22: Compositionally Adjusted Substitution Matrices for Protein Database Searches

ROCn curves for SCOP set (Proc IEEE 9: 1834-1847)

Page 23: Compositionally Adjusted Substitution Matrices for Protein Database Searches

Future directions

• Possible less extensive use of SEG when compositional adjustment is invoked.

• Application to PSI-BLAST.


Recommended