Compositionally Adjusted Substitution Matrices for Protein Database Searches

Preview:

DESCRIPTION

Compositionally Adjusted Substitution Matrices for Protein Database Searches. Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Collaborators. Yi-Kuo YuAlejandro Sch ä ffer John WoottonRicha Agarwala - PowerPoint PPT Presentation

Citation preview

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Stephen Altschul

National Center for Biotechnology Information National Library of MedicineNational Institutes of Health

Collaborators

Yi-Kuo Yu Alejandro SchäfferJohn Wootton Richa Agarwala Mike Gertz Aleksandr Morgulis

National Center for Biotechnology InformationNational Library of MedicineNational Institutes of Health

See: Yu, Wootton & Altschul (2003) PNAS 100:15688-15693; Yu & Altschul (2005) Bioinformatics 21: 902-911;

Altschul et al. (2005) FEBS J. 272:5101-5109.

Log-odds scoresThe scores of any local-alignment substitution matrix can be written in the form

where the pi are background amino acid frequencies, the qij are target frequenciesand λ is an arbitrary scale factor.(PNAS 87:2264-2268)

The BLOSUM-62 matrixA 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

PNAS 89:10915-10919

Amino acid compositional bias

Some sources of bias:

Organismal bias AT-rich genome: tend to have more amino acids FLINKYM GC-rich genome: tend to have more amino acids PRAWG

Protein family bias Transmembrane proteins: more hydrophobic residues Cysteine-rich proteins: more Cysteines than usual

Construction of an asymmetric log-odds substitution matrix

Given a (not necessarily symmetric) set of targetfrequencies qij, define two sets of backgroundfrequencies pi and p’j as the marginal sums of the qij :

The substitution scores are then defined as

i

ijjj

iji qpqp ';

We call this matrix valid in the context of the pi and p’j.

Substitution matrix validity theorem

A substitution matrix can be valid for only a unique set of target and background frequencies, except in certain degenerate cases. (Proof omitted)

One can determine efficiently whether an arbitrary substitution matrix can be valid in some context and, if so, one can extract its unique target and background frequencies, and scale. (Proof and algorithms omitted)

Choosing new target frequenciesGiven new sets of background frequencies Pi and P’j , how should one choose appropriate target frequencies Qij ?

Consistency constraints:

Close to original qij :

Sometimes, it is desirable to constrain the relative entropy H

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Performance evaluation (mode D vrs. mode A)

Amino P. falciparum M. tuberculosisAcid BLOSUM62 #16805184 #15607948----- --------------- --------------- ----------------- A 7.4 4.8 13.9 R 5.2 4.1 7.4 N 4.5 8.9 2.8 D 5.3 5.6 5.9 C 2.5 2.1 1.9

Q 3.4 3.0 3.6 E 5.4 7.0 6.1 G 7.4 6.2 9.5 H 2.6 3.1 1.7 I 6.8 9.0 4.4

L 9.9 8.2 9.3 K 5.8 8.2 1.9 M 2.5 1.3 1.5 F 4.7 5.1 2.5 P 3.9 3.8 5.3

S 5.7 7.4 4.4 T 5.1 2.3 5.7 W 1.3 1.0 0.8 Y 3.2 4.6 2.8 V 7.3 4.4 8.7

BLOSUM-62 and sequence specific background frequencies

A -15 -55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9 A R N D C Q E G H I L K M F P S T W Y V

Difference between a scaled, standard BLOSUM-62 and a compositionally adjusted BLOSUM-62

Entries shown: score of standard matrix subtracted from the adjusted one

P. falciparum

Optimal alignments implied by modes A and D

Mode A: 29.7 bits (H = 0.51 nats)

Mode D: 31.8 bits (H = 0.51 nats)

Mode C: 33.1 bits (H = 0.44 nats)

Substitution matrices compared

Mode A: Standard BLOSUM-62 matrix.

Mode B: Composition-adjusted matrix; no constraint on relative entropy (H).

Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats).

Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

Performance of various matrices on 143 pairs of related sequences (FEBS J. 272:5101-5109)

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

One metric definition of distance between two composition vectors

(IEEE Trans. Info. Theo. 49:1858-1860)

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distance d between the compositions of the two sequences is less than 0.16.

Law of cosines

In a triangle with sides of length a,b and c, the angle opposite the side of length c is

Empirical rules for invoking compositional adjustment when comparing two sequences

1: The length ratio of the longer to the shorter sequence is less than 3.

2: The distance d between the compositions of the two sequences is less than 0.16.

3: The angle θ made by the compositions of the two sequences with the standard composition is less than 70o.

ROCn curves for Aravind set (NAR 29: 2994-3005)

b

ROCn curves for SCOP set (Proc IEEE 9: 1834-1847)

Future directions

• Possible less extensive use of SEG when compositional adjustment is invoked.

• Application to PSI-BLAST.

Recommended