36
Substitution matrices Substitution matrices

Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Embed Size (px)

Citation preview

Page 1: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Substitution matricesSubstitution matrices

Page 2: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Outline

• Why do we need scoring matrices?• Calculation of alignment costs

• Pairwise• Multiple• Database searches

• Phylogenetic distance calculations

Page 3: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

What is a Sequence What is a Sequence Alignment?Alignment?

Given two nucleotide or amino acid sequences, determine if they are related (descended from a common ancestor)

Technically, we can align any two sequences, but not always in a meaningful way

Page 4: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

What is a Sequence What is a Sequence Alignment? (cont.)Alignment? (cont.)

HIGHLY RELATED:HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL

G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

RELATED:HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL

++ ++++H+ KV + +A ++ +L L+++H+ KLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

SPURIOUS ALIGNMENT:HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

GS+ + G + +D L ++ H+ D+ A +AL D ++AH+F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE

How to filter out the last one & pick up the second?

Page 5: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Why Should We Care?Why Should We Care?

Fragment assembly in DNA sequencing– Experimental determination of nucleotide

sequences is only reliable up to about 500-800 base pairs (bp) at a time

– But a genome can be millions of bp long!– If fragments overlap, they can be assembled:...AAGTACAATCA CAATTACTCGGA...– Need to align to detect overlap

Page 6: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Why Should We Care? Why Should We Care? (cont.)(cont.)

Finding homologous proteins and genes

– i.e. evolutionarily related (common ancestor)

– Thus want to avoid the spurious alignment

Page 7: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

What do we do?What do we do?

Model evolutionary process-mutational changes to DNA

Can you name some?

Page 8: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

MutationsMutations

Nothing happensSubstitutions

– Transitions (A-G; C-T)– Transversions (A-C, A-T, C-T, T-G)

InsertionsDeletions

Page 9: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

How do we do it?How do we do it?

Choose a scoring schemeChoose an algorithm to find optimal

alignment with respect to a scoring schemeStatistically validate alignment

Page 10: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Scoring SchemesScoring Schemes

Since goal is to find related sequences, want evolution-based scoring scheme– Mutations occur often at the genomic level, but their

rates of acceptance by natural selection vary depending on the mutation

– I.e. changing an AA to one with similar properties is more likely to be accepted

Assume that all changes occur independently of each other and are Markovian (makes working with probabilities easier)

Page 11: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

If AA ai is aligned with aj, then aj was substituted for ai

...KALM...

...KVLM... Was this due to an accepted mutation or simply by

chance?– If A or V is likely in general, then there is less evidence that

this is a mutation

Want the score sij to be higher if mutations more likely– Take ratio of mutation prob. to prob. of AA appearing at

random

Generally, if aj is similar to ai in property, then accepted mutation more likely and sij higher

Page 12: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

MutationsMutations

Only consider immediate mutations ai -> aj, not ai -> ak -> aj

Mutations are undirected

=> scoring matrix is symmetric

Markov assumptions

Page 13: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Nucleic acid substitution Nucleic acid substitution matricesmatrices

Uniform mutation ratesHigher transitions () than transversions ()

C T

A G

Pyrimidines

Purines

Page 14: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Mutation matricesMutation matrices

A G T C

A 0.99

G 0.00333 0.99

T 0.00333 0.00333 0.99

C 0.00333 0.00333 0.00333 0.99

A G T C

A 0.99

G 0.006 0.99

T 0.002 0.002 0.99

C 0.002 0.002 0.006 0.99

•3 fold higher transitions () than transversions ()

•Uniform mutation rates

1% probability of change at each position (Mij values)

Page 15: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

log-odds ratiolog-odds ratio

Alignment: ATC

AGT

Probability under match hypothesis:

P(alignment | match) = P(A,A | match)P(T,G | match)P(C,T | match) Probability under random hypothesis:

P(alignment | random ) = P(A,A | random )P(T,G | random)P(C,T | random) Relative probability (odds ratio) R(alignment) = P(alignment | match) / P(alignment | random )

= R(A,A)R(T,C)R(C,T) For computational reasons we take logarithms

log R(alignment) = log R(A,A) + log R(T,G) + log R(C,T)

Page 16: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

log-odds ratio (cont.)log-odds ratio (cont.)

Use mutation matrix to derive log odds matrix

The probability (sij) of obtaining a match between nucleotides i and j, divided by the random probability of aligning i and j is given by

sij = log (piMij / pipj)

= log (Mij / pj)

where Mij is the mutation matrix value and pi and pj are the fractional compositions of each nucleotide.

Page 17: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Substitution matricesSubstitution matrices

A G T C

A 2

G -6 2

T -6 -6 2

C -6 -6 -6 2

A G T C

A 2

G -5 2

T -7 -7 2

C -7 -7 -5 2

•3 fold higher transitions () than transversions ()

•Uniform mutation rates

1% probability of change at each position, log base 2

Page 18: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Gap penaltiesGap penalties

Why?

Page 19: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Gap penaltiesGap penalties

Modeling biologyInsertions / deletions:

– Single nucleotide– Unequal crossover– Introns, transposons etc.

Gap score wx for gap of length x consists of Gap opening (g) and gap extension (r) penalties.

wx = g + rx

Page 20: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Protein matricesProtein matrices

PAMBLOSUM

Page 21: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

The PAM Transition MatricesThe PAM Transition Matrices

Dayhoff et al. started with several hundred alignments between very closely related proteins

Compute the frequency with which each AA is changed into each other

AA changes over a short evolutionary distance (short enough where only 1% AAs change)

1 PAM = 1% point accepted mutation (accepted by natural selection, i. e. not altering the function of the protein)

Page 22: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Phylogenetic treePhylogenetic tree

ACGH DBGH ADIJ CBIJ

B-C A-D B-D A–C ABGH ABIJ

I-G J-H ACGH DBGHADIJ CBIJ

Page 23: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

PAMs (cont.)PAMs (cont.)

Estimate pi with the frequency of AA ai over both sequences, i.e. # of ai’s/number of AAs

Let fij = fji = # of ai <-> aj changes in data set,

fi = ji fij and f = i fi

Define the scale to be the amount of evolution to change 1 in 100 AAs (on average) [1 PAM dist]

Relative mutability of ai is the ratio of number of mutations to total exposure to mutation:

mi = fi / (100fpi)

Page 24: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

PAMs (cont.)PAMs (cont.)

If mi is probability of a mutation for ai, then Mii = 1 - mi is prob. of no change

ai -> aj if and only if ai changes and ai -> aj given that ai changes,

so Mij = Pr(ai -> aj)

= Pr(ai -> aj | ai changed)Pr(ai changed)

= (fij/fi)mi = fij/(100fpi)

The 1 PAM transition matrix consists of the Mii’s and gives the probabilities of mutations from ai to aj

Page 25: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

What About What About 22 PAM? PAM?How about the probability that ai -> aj in

two evolutionary steps?It’s the prob. that ai -> ak (for any k) in

step 1, and ak -> aj in step 2.

This is kMikMkj = Mij2

Page 26: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

k k PAM Transition MatrixPAM Transition Matrix

In general, the probability that ai -> aj in k evolutionary steps is Mij

k

As k -> , the rows of Mk tend to be identical with the ith entry of each row equal to pi

– A result of our Markovian assumption of mutation

Page 27: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Building a Scoring MatrixBuilding a Scoring Matrix

When aligning different AAs in two sequences, we want to differentiate mutations and random events

We are thus interested in the ratio of transition probability to prob. of randomly seeing new AA

Ratio > 1 if and only if mutation more likely than random event

Page 28: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Building a Scoring Matrix Building a Scoring Matrix (cont’d)(cont’d)

When aligning multiple AAs, ratio of probs for multiple alignment = product of ratios:

Taking logs will let us use sums rather than products

Page 29: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Building a Scoring Matrix Building a Scoring Matrix (cont’d)(cont’d)

Final step: computation faster with integers than with reals, so scale up (to increase precision) and round:

sij = C log2(Mij/pj)

C is a scaling constantFor k PAM, use Mij

k

Page 30: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance
Page 31: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

PAM Scoring Matrix PAM Scoring Matrix MiscellanyMiscellany

Pairs of AAs with similar properties (e.g. hydrophobicity) have high pairwise scores, since similar AAs are more likely to be accepted mutations.

In general, low PAM numbers find short, strong local similarities and high PAM numbers find long, weak ones.

Often multiple searches will be run, using e.g. 40 PAM, 120 PAM, 250 PAM.

Page 32: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

BLOSUM Scoring MatricesBLOSUM Scoring Matrices

Based on multiple alignments, not pairwise.

Direct derivation of scores for more distantly related proteins.

Only possible because of new data: multiple alignments of known related proteins.

Page 33: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

BLOCKS database entryBLOCKS database entry

Page 34: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

BLOSUM Scoring Matrices BLOSUM Scoring Matrices (cont’d)(cont’d)

Started with ungapped alignments from BLOCKS database

Sequences clustered at L% sequence identity This time fij = # of ai <-> aj changes between

pairs of sequences from different clusters, normalizing by dividing by (n1n2) = product of sizes of clusters 1 and 2

fi = j fij and f = i fi (different from PAM i=j) Then the scoring matrix entry is

Page 35: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

BLOSUM BLOSUM 5050 Scoring Matrix Scoring Matrix

Page 36: Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance

Substitution matrices: Substitution matrices: summarysummary

Nucleic acid matrices are model derivedAmino acid matrices are derived from

observed data– PAM: closely related sequences– BLOSUM: BLOCKS domain database

ratio of odds: pr. of real / pr. of random match

log-odds: additive scoring system