21
1 Additive Distances Between DNA Sequences MPI, June 2012

1 Additive Distances Between DNA Sequences MPI, June 2012

Embed Size (px)

Citation preview

Page 1: 1 Additive Distances Between DNA Sequences MPI, June 2012

1

Additive Distances

Between DNA Sequences

MPI, June 2012

Page 2: 1 Additive Distances Between DNA Sequences MPI, June 2012

Additive Evolutionary distance :The number of substitutions which occurred

during the sequence evolution

AC

CC

C G T A1 2 3

1

site 1

site 2

substitutions

Some substitutions are hidden, due to overwriting.Therefore, the exact number of subst. is usually larger than the number of observed changes.

site 30

Page 3: 1 Additive Distances Between DNA Sequences MPI, June 2012

3

Edge weight = Expected number of substit’s per site

A A C A … G T C T T C G A G G C C Cu

v A G C A … G C C T A T G C G A C C T

MPI, June 2012

0 1 0 0 … 0 2 0 0 1 1 0 1 2 1 0 0 1

0.321 Number of substitutions per site

Page 4: 1 Additive Distances Between DNA Sequences MPI, June 2012

4

When the exact number of substitutions between any two

sequences is known, NJ (and any other algorithm which

reconstructs trees from the exact distances) returns the

correct evolutionary tree.

Interleaf distances: sum of edge weights

vu0.5

0.42

0.3

d(u,v) = 1.12

Page 5: 1 Additive Distances Between DNA Sequences MPI, June 2012

5

Estimating # of substitutionsfrom observed substitutions

requires

Substitution Model

JC [Jukes Cantor 1969] Kimura 2 Parameter (K2P) [Kimura 1980]

HKY [Hasegawa, Kishino and Yano 1985]

TN [Tamura and Nei 1993]

GTR: Generalised time-reversible [Tavaré 1986]

…and more…

Page 6: 1 Additive Distances Between DNA Sequences MPI, June 2012

6

Distance estimation in the

Jukes Cantor

model

Page 7: 1 Additive Distances Between DNA Sequences MPI, June 2012

7

Jukes Cantor model:All substitutions are equally like

JC generic rate matrix t is the expected # of substitutions per site

u

v

tuv

TCGA

t/3t/3t/3 -tA

t/3t/3 -tt/3G

t/3 -tt/3t/3C

-tt/3t/3t/3T

Ruv =

Page 8: 1 Additive Distances Between DNA Sequences MPI, June 2012

8

Substitution Matrix P

431

4( ) 1 tep t

expected number of substitutions per sitet

(Theory of Markov Processes)

TCGA

t/3t/3t/3 -tA

t/3t/3 -tt/3G

t/3 -tt/3t/3C

-tt/3t/3t/3T

R =

TCGA

p (t )p (t )p (t )1-3p (t )A

p (t )p (t )1-3p (t )p (t )G

p (t )1-3p (t )p (t )p (t )C

1-3p (t )p (t )p (t )p (t )T

substitution probability( )p t

Rate Matrix R

P =

Page 9: 1 Additive Distances Between DNA Sequences MPI, June 2012

9

JC distance estimation:First estimate the substitution matrix

u A A C A … G T C T T C G A G G C C C

v A G C A … G C C T A T G C G A C C T

an Estimation of Puv

From observed substit’s

uvP TCGA

A

G

C

T

ˆ ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ ( )p tˆ ( )p t

ˆ ( )p t ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p tˆ ( )p t

1 number of observed substit'sˆ ( )

3 total number of sitesp t

Page 10: 1 Additive Distances Between DNA Sequences MPI, June 2012

10

Estimate t from estimation of p(t)by “reverse engineering”

34

ˆ ˆln(1 4 ( ))t p t

Solve the formula for p(t)

uvP

ˆ uvR

TCGA

A

G

C

T

ˆ ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ1 3 ( )p t

ˆ ( )p tˆ ( )p t

ˆ ( )p t ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p t

ˆ ( )p tˆ ( )p t

TCGA

A

G

C

T

t

t

t

t

ˆ 3t ˆ 3tˆ 3t

ˆ 3tˆ 3t

ˆ 3t

ˆ 3t

ˆ 3t

ˆ 3t

ˆ 3t

ˆ 3t ˆ 3t

Page 11: 1 Additive Distances Between DNA Sequences MPI, June 2012

11

Checking the effect

of estimation-errorsin Reconstructing

Quartets

Page 12: 1 Additive Distances Between DNA Sequences MPI, June 2012

12

Quartets Reconstruction = Finding the correct split

A C

B D

A B

C D

A C

D B

Quartets are trees with four leaves. They have threepossible (fully resolved) topologies, called splits:

Distance methods resolves splits by the

4 point method

להגיד שהדרגות הן 3, ולציין את הקשר בין אורך הקשת לעמידות לרעש. לסלק אנימציה חוץ מסלוק שני הרביעיות
Page 13: 1 Additive Distances Between DNA Sequences MPI, June 2012

13

The 4 points method

A C

B D

The 4-point condition:

wsep

The 4-point condition for estimated distances:

2 2 2 2 2 2( , ) ( , ) min ( , ) ( , ) , ( , ) ( , )K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C

2 2 2 2 2 2( , ) ( , ) ( , ) ( , ) ( , ) ( , )2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C

להגיד שהדרגות הן 3, ולציין את הקשר בין אורך הקשת לעמידות לרעש. לסלק אנימציה חוץ מסלוק שני הרביעיות
Page 14: 1 Additive Distances Between DNA Sequences MPI, June 2012

14

Evaluate the accuracy ofreconstructing quartets

using evolutionary distances

root

D

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

t

10t

CA

B

10t 10t10t

t-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

-αββC

α-ββT

ββ-αG

ββα-A

CTGA

t is “evolutionary time”

The diameter of the quartet is 22t

Page 15: 1 Additive Distances Between DNA Sequences MPI, June 2012

15

Phase A: simulate evolution

DC

AB

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

Page 16: 1 Additive Distances Between DNA Sequences MPI, June 2012

16

Phase B: reconstruct the split by the 4p condition

DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA

÷÷÷÷÷÷÷÷

øçççççççç

è

ˆˆ ( , ) ( , )i jD i j d s sApply the 4p condition.

Is the recontruction correct?

compute distances between sequences,

Repeat this process 10,000 times,

count number of failures

Page 17: 1 Additive Distances Between DNA Sequences MPI, June 2012

17

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

This test was applied on the model quartet with various diameters

For each diameter, mark the fraction (percentage) of the

simulations in which the reconstruction failed (next slide)

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

C

AB

10t 10t 10t

t

root

D

t

10t

C

AB

10t 10t 10t

t … …

Page 18: 1 Additive Distances Between DNA Sequences MPI, June 2012

18

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

quartet diameter (total rate between furthest leaves)

Fra

ctio

n of

failu

res

out o

f 100

00 e

xper

imen

tsperformance of K2P standard distance method in resolving quartets, R=10

Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2

root

D

t

10t

CA

B

10t 10t 10t

t

root

D

t

10t

CA

B

10t 10t 10t

t

Templatequartet

Page 19: 1 Additive Distances Between DNA Sequences MPI, June 2012

19

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

quartet diameter (=mutations rate between furthest leaves)

Fract

ion

of fa

ilure

s out

of 10

000 si

mul

atio

nsperformance of K2P standard distance method in resolving quartets,

For quartet ratio 0.1, R=10

Performance for larger diameters

“site saturation”

Page 20: 1 Additive Distances Between DNA Sequences MPI, June 2012

20

Repeat this experiment

on the

Hasegawa tree

• Assume the JC model. • Reconstruct by the NJ algorithm (use

any variants of NJ available in MATLAB)

Page 21: 1 Additive Distances Between DNA Sequences MPI, June 2012

Hasegawa Tree

21