19
Substitution Numbers and Scoring Matrices

Substitution Numbers and Scoring Matrices. The number of observed substitutions K is an important quantity in molecular evolutionary analysis A simple

Embed Size (px)

Citation preview

Page 1: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Substitution Numbersand

Scoring Matrices

Page 2: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

The number of observed substitutions K is an important quantity in molecular evolutionary analysis

A simple count may be misleading, so statistical models are developed to estimate the number of substitutions Jukes-Cantor model Kimura model (both are for nucleotides, but the ideas can extend to amino acids)

Substitution Numbers

Page 3: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Assumes that each nucleotide is equally likely to change into any other nucleotide with probability α per time step

What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) =

C -> C -> C C -> A -> C C -> T -> C C-> G -> C

Jukes-Cantor Model

A

C

G

αα

α

α

α

A T C G

A φ α α α

T α φ α α

C α α φ α

G α α α φ

φ = 1 - 3 α

Page 4: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

The entry M(a,b) in the matrix M1 represents the probability

of substitution from nucleotide a to b in one time step

What is the matrix M2, i.e. whose entries M(a,b) represent the probability of substitution from a to b in two time steps

essentially what we did on prev. slide but for all pairs of basesA->X->A A->X->T A->X->C A->X->GT->X->A T->X->T T->X->C T->X->GC->X->A C->X->T C->X->C C->X->GG->X->A G->X->T G->X->C G->X->G

Jukes-Cantor Model

A T C G

A φ α α α

T α φ α α

C α α φ α

G α α α φ

φ = 1 - 3 α

M1 =

C->X->C = α∙α + α∙α + φ∙φ + α∙α (prev. slide)

C->X->A = α∙φ + α∙α + φ∙α + α∙α

Page 5: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Turns out that Mn = (M1)n i.e. whose entries M(a,b) represent the probability of substitution from a to b in n time steps

In general under the J.C. model the probability that a site will contain a C after t time steps is given by:

Pc(t) = ¼ + (¾)e-4αt

This model can be used to derive an estimate of the number of substitutions that have occurred between the sequences

K = -¾ ln[ 1 – (4/3) p ]

p – the fraction of nucleotides that are considered mismatch

Jukes-Cantor Model

Page 6: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Addresses the unrealistic assumption in J.C. model that all substitutions are equally likely

Two types of substitutions transitions – purine<=>purine exchange or pyrimidine<=>pyrimidine transversions – purine<=>pyrimidine exchange

Kimura Model

A

C

G

ββ

α

β

β

A T C G

A φ β β Α

T β φ α β

C β α φ β

G α β β φ

φ = 1 – α – 2 β

Page 7: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) =

C -> C -> C C -> A -> C C -> T -> C C-> G -> C

In general under the Kimura model the probability that a site will contain a C after t time steps is given by:

Pc(t) = ¼ + (¼)e-4βt + (½)e-2(α+β)t

Estimated number of substitutions (TR – transitions, TV – transverions)

K = ½ ln[ 1 / (1 – 2*TR – TV)] + ¼ ln[ 1 / (1 – 2*TV)]

Kimura Model

Page 8: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Scoring Matrices

Page 9: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Alignment score attempts to measure likelihood of a common evolutionary ancestor

Two possible ways to explain a given pairwise alignment random model – the alignment could be produced purely by chance evolutionary model – there is high correlation between aligned pairs

Under random model each position is independent of the others probability of amino acid a occurring at each position is pa

Under non-random model probability of amino acid a depends on matched residue b – qab

Alignment Score

Page 10: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Given a (non-gapped) pairwise alignment of sequences

A = a1 a2 a3 a4…an

B = b1 b2 b3 b4…bn

under non-random model probability of the alignment

Pnon-random = qa1b1qa2b2qa3b3qa4b4…qanbn

under random model probability of the alignment

Prandom = pa1pa2pa3pa4…pan pb1pb2pb3pb4…pbn =

pa1pb1pa2pb2pa3pb3qa4pb4…panpbn

Use ratio of probabilities (odds ratio) to compare the models

r = –––––––– r > 1, non-random more likely

Substitution Matrices

Pnon-random

Prandom

Page 11: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Ratio of probabilities (odds ratio)

r = –––––––– = ––––––––––––––––––––––––––––––

= ––––––––––––––––––––––––––––––

Typically the log-odds ratio is used

log(r) = log( –––––––––––––––––––––––––––––– )

= log(––––––)+log(––––––)+log(––––––)+ ... +log(––––––)

Substitution Matrices

Pnon-random

Prandom

qa1b1qa2b2qa3b3qa4b4 …qanbn

pa1pb1pa2pb2pa3pb3qa4pb4…panpbn

qa1b1 qa2b2 qa3b3 qa4b4 … qanbn

pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn

qa1b1 qa2b2 qa3b3 qa4b4 … qanbn

pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn

qa1b1

pa1pb1

qa2b2 qa3b3 qanbn

pa2pb2 pa3pb3 panpbn

Entry (a1, b1) in the substitution matrix

Page 12: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Provide the “likelihood” that two amino acids (nucleotides) will occur as aligned pair

Common substitution matrices for protein alignment PAM family – derived from alignments of high sequence identity

(Dayhoff, Schwartz, and Orcutt. “A model of evolutionary change in proteins. In Atlas of

Protein Sequence and Structure Volume 5. 1978:345-352)

BLOSUM family – derived from alignments of low sequence identity

(Henikoff and Henikoff. “Amino acid substitution matrices from protein blocks”. Proc. Natl. Acad. Sci. 1992. 89(22): 10915–10919.)

Substitution Matrices

BLOSUM62

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Page 13: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Based on ungapped multiple local alignments of conserved regions of proteins with low sequence identity

These alignments are used to derive qab pa pb which give the substitution score for amino acids a and b

score(a, b) = log(––––––)

Procedure obtain known ungapped multiple local alignments split into clusters, so that every pair in a cluster has ≥ C% identity for each pair of amino acids a and b calculate

qab = frequency of a,b pair / total # pairs (sequences within a cluster are given weight 1 / size_of_cluster)

BLOSUM Matrices

qab

papb

Page 14: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Calculating qQN for BLOSUM62 – within a cluster for each sequence there is one with (≥ 62% identity)

ATCKQATCRNASCKNSSCRNSDCEQSECENTECRQ

BLOSUM Matrices

7 clusters, 21 pairs of clusters

5*21 = 105 total # of aligned pairs

QN matched in 12 pairs of clusters

qQN = frequency of QN pair / total # aligned pairs= 12 / 105 = 0.114

Page 15: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity)

ATCKQATCRNASCKNSSCRNSDCEQSECENTECRQ

BLOSUM Matrices

3 clusters, 3 pairs of clusters5 bases * 3 clusters = 15 total # of

aligned pairs

QN match frequency (between clusters):top, mid:top, bot:mid, bot:total:

qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166

Page 16: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity)

ATCKQATCRNASCKNSSCRNSDCEQSECENTECRQ

BLOSUM Matrices

3 clusters, 3 pairs of clusters5 bases * 3 clusters = 15 total # of

aligned pairs

QN match frequency (between clusters):top, mid: ¼*½ + ¾*½ top, bot: ¾*1 mid, bot: ½*1total: 1/8+3/8+3/4+1/2 = 14/8

qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166

Page 17: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

So far calculated qabN (i.e. probability that a and b will be paired up under non-random model)

To compute the substitution score need to know pa and pb

(i.e. probability that a and b occur by chance)

pa = qaa + ½ Σa≠bqab ≈ fraction of all amino acids that are

type a

The entry computed in the substitution matrix is:

BLOSUM Matrices

score(a, b) = log(––––––) qab

papb

Page 18: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Based on ungapped multiple local alignments of conserved regions of proteins with high sequence identity (> 85%)

Uses phylogenetic trees to compute the entries in the substitution matrix

Procedure build a phylogenetic tree for sequence of high identity compute relative mutability, ma, of each amino acid

(frequency of a substitutions in the phylogenetic tree)

compute Fab (number of substitutions of a with b)

compute Mab (mutation probability that a will be replaced by b)

Mab = mb Fab / ΣcFcb

compute entry in scoring matrixscore(a, b) = log(Mab / frequency of a)

PAM Matrices

Page 19: Substitution Numbers and Scoring Matrices.  The number of observed substitutions K is an important quantity in molecular evolutionary analysis  A simple

Constructing a PAM matrix

PAM Matrices

ACGCTAFKI

GCGCTAFKI ACGCTAFKL

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL

A->G I->L

A->G A->L C->S G->A

Compute score(G, A) – need mA, FGA, ΣcFcA

1) ma = 4 / 2*62) FGA = 33) Σ FcA = 44) Mab = mA FGA / ΣcFcA

5) score(G, A) = log(MGA/ frequency_of_G) = log(MGA/ (10/63))