21
Distance-based methods Xuhua Xia [email protected] http://dambe.bio.uottawa.ca

Distance-based methods Xuhua Xia [email protected]

Embed Size (px)

Citation preview

Page 1: Distance-based methods Xuhua Xia xxia@uottawa.ca

Distance-based methods

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca

Page 2: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 2

Lecture Outline• Objectives in this lecture

– Grasp the basic concepts distance-based tree-building algorithms

– Learn the least-squares criterion and the minimum evolution criterion and how to use them to construct a tree

• Distance-based methods– Genetic distance: generally defined as the number of substitutions per site.

• JC69 distance• K80 distance• TN84 distance• F84 distance• TN93 distance• LogDet distance

– Tree-building algorithms (UPGMA): • UPGMA• Neighbor-joining• Fitch-Margoliash• FastME

Page 3: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 3

Genetic Distances

• Genetic distances: Assuming a substitution model, we can obtain the genetic distance (i.e., difference) between two nucleotide or amino acid sequences, e.g.,

• JC

• K80

• TN93:

3

41ln

4

3 pK JC

80

1 1ln ln

1 2 1 2

2 4K

P Q QK

RY2GA1CT93 4 + 4 + 4TND

Y 1

T C YY

Y

P Q-ln 1- - ln 1

2 2 2 =

2

RY R

Q

R 2

A G RR

R

P Q-ln 1- - ln 1

2 2 2 =

2

YY R

Q

2

21ln

RY

Q

Page 4: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 4

Calculation of KJC69

3 4ln 1

4 3

pK

AACGACGATCG: Species 1

AACGACGATCG

AACGACGATCG: Species 2

t

t

The time is 2t between Species 1 to Species 2

Sp1: AAG CCT CGG GGC CCT TAT TTT TTG

|| | ||| ||| | ||| ||| ||

Sp2: AAT CTC CGG GGC CTC TAT TTT TTT

p = 6/24 = 0.25

K = 0.304099

Genetic distances are scaled to be the number of substitutions per site.

Page 5: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 5

Numerical Illustration

Sp1: AAG CCT CGG GGC CCT TAT TTT TTG

|| | ||| ||| | ||| ||| ||

Sp2: AAT CTC CGG GGC CTC TAT TTT TTT

What are P and Q?

P = 4/24, Q = 2/24

80

ln 1 2 ln 1 20.31507864

2 4K

P Q QK

Comparison of distances:

P = 0.25

Poisson P = -ln(1-p) = 0.288

KJC69 = 0.304099

KK80 = 0.3150786

Page 6: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 6

Distance-based phylogenetic algorithms

Algorithms Optimization Assuming a molecular clockUPGMA Local YesNeighbor-joining Local NoMinimum EvolutionGlobal NoFitch-Margoliash Global No FastME Global No

Page 7: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 7

A Star Tree (Completely Unresolved Tree)

Human

Chimpanzee

Gorilla

Orangutan

Gibbon

Page 8: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 8

Genetic Distance Matrix

Matrix of Genetic distances (Dij):

Human Chimp Gorilla Orang GibbonHuman 0.015 0.045 0.143 0.198Chimp 0.030 0.126 0.179Gorilla 0.092 0.179Orang 0.179Gibbon

Page 9: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 9

• Human Chimp Gorilla Orang GibbonHuman 0.015 0.045 0.143 0.198Chimp 0.030 0.126 0.179Gorilla 0.092 0.179Orang 0.179Gibbon

• D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038 D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189

• hu-ch Gorilla Orang Gibbonhu-ch 0.038 0.135 0.189Gorilla 0.092 0.179Orang 0.179Gibbon

HumanChimpGorillaOrangGibbon

GorillaOrangGibbonHumanChimp

UPGMA

OrangGibbonGorillaHumanChimp

(hu,ch),(go,or,gi)

((hu,ch),go),(or,gi)

Page 10: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 10

• Human Chimp Gorilla Orang GibbonHuman 0.015 0.045 0.143 0.198Chimp 0.030 0.126 0.179Gorilla 0.092 0.179Orang 0.179Gibbon

• D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185

• hu-ch-go Orang Gibbonhu-ch-go 0.120 0.185Orangutan 0.179Gibbon

• D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184

OrangGibbonGorillaHumanChimp

GibbonOrangGorillaHumanChimp

UPGMA

(((hu,ch),go),or),gi)

Page 11: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 11

Phylogenetic Relationship from UPGMA

• Human Chimp Gorilla Orang GibbonHuman 0.015 0.045 0.143 0.198Chimp 0.030 0.126 0.179Gorilla 0.092 0.179Orang 0.179Gibbon

• hu-ch Gorilla Orang Gibbonhu-ch 0.038 0.135 0.189Gorilla 0.092 0.179Orang 0.179Gibbon

• hu-ch-go Orang Gibbonhu-ch-go 0.120 0.185Orang 0.179Gibbon

Page 12: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 12

Branch Lengths

((hu,ch),(go,or,gi))

(((hu,ch),go),(or,gi))

((((hu,ch),go),or),gi)

Dhu-ch = 0.015D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038 D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189

D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185

D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184

((hu:0.0075,ch:0.0075),(go,or,gi))

(((hu:0.0075,ch:0.0075):0.019,go:0.019),(or,gi))

((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092)

Human

Chimp

Gorilla

Orang

Gibbon

0.0075

0.019

0.06

0.092

Page 13: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 13

Final UPGMA TreeHuman

Chimp

Gorilla

Orang

Gibbon

0.092 0.060 0.019 0.0075

19 13 8 6 MY

((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092);

Page 14: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 14

Distance-based method

• Distance matrix

• Tree-building algorithms– UPGMA

– Neighbor-joining

– FastME

– Fitch-Margoliash

• Criterion-based methods– Branch-length estimation

– Tree-selection criterion

Page 15: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 15

Branch Length Estimation• For three OTUs, the branch lengths can be estimated

directly

• For more than three OTUs, there are two commonly used methods for estimating branch lengths– The least-square method

– Fitch-Margoliash method

• Don’t confuse the Fitch-Margoliash method of branch length estimation with the Fitch-Margoliash criterion of tree selection

• Illustration of the least-square method of branch length estimation

Page 16: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 16

For three OTUs

1 2 3 1 0.092 0.1792 0.1793

1 2 31 d12 d13 2 d23 3

d12 = x1 + x2

d13 = x1 + x3

d23 = x2 + x3

x1

2

1

x3

x2

3

Page 17: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 17

Least-square method

4

x1

3

2

1

x5

x4

x3

x2

4Sp1Sp2 0.3Sp3 0.4 0.5Sp4 0.4 0.6 0.6

4

Sp1

Sp2 d12

Sp3 d13 d23

Sp4 d14 d24 d34

Page 18: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 18

Least-square method

4

x1

3

2

1

x5

x4

x3

x2

d’12 = x1 + x2

d’13 = x1 + x5+ x3

d’14 = x1 + x5 + x4

d’23 = x2 + x5 + x3

d’24 = x2 + x5 + x4

d’34 = x3 + x4

(d12 - d’12)2= [d12 – (x1 + x2)]2

(d13 - d’13)2 = [d13 – (x1 + x5+ x3)]

2

(d14 - d’14)2 = [d14 – (x1 + x5 + x4)]

2

(d23 - d’23)2 = [d23 – (x2 + x5 + x3)]

2

(d24 - d’24)2 = [d24 – (x2 + x5 + x4)]

2

(d34 - d’34)2 = [d34 – (x3 + x4)]

2

n

jiijij ddSS 2' )( Least-squares method: Find xi

values that minimize SS

Page 19: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 19

Least-squares method

SS = [d12 – (x1 + x2)]2 + [d13 – (x1 + x5+ x3)]

2 + [d14 – (x1 + x5 + x4)]2

+ [d23 – (x2 + x5 + x3)]2+ [d24 – (x2 + x5 + x4)]

2+ [d34 – (x3 + x4)]2

Take the partial derivative of SS with respective to xi, we have SS/x1 := -2 d12 + 6 x1 + 2 x2 - 2 d13 + 4 x5 + 2 x3 - 2 d14 + 2 x4

SS/x2 := -2 d12 + 2 x1 + 6 x2 - 2 d23 + 4 x5 + 2 x3 - 2 d24 + 2 x4

SS/x3 := -2 d13 + 2 x1 + 4 x5 + 6 x3 - 2 d23 + 2 x2 - 2 d34 + 2 x4

SS/x4 := -2 d14 + 2 x1 + 4 x5 + 6 x4 - 2 d24 + 2 x2 - 2 d34 + 2 x3

SS/x5 := -2 d13 + 4 x1 + 8 x5 + 4 x3 - 2 d14 + 4 x4 - 2 d23 + 4 x2 - 2 d24

Setting these partial derivatives to 0 and solve for x i, we have

x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4,x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4,x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4,x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4

Page 20: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 20

Least-squares method

x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4,x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4,x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4,x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4

4Sp1Sp2 0.3Sp3 0.4 0.5Sp4 0.4 0.6 0.6

x1 = 0.075x2 = 0.225x3 = 0.275x4 = 0.325x5 = 0.025

4

x1

3

2

1

x5

x4

x3

x2

Page 21: Distance-based methods Xuhua Xia xxia@uottawa.ca

Xuhua Xia Slide 21

Minimum Evolution Criterion

4

x1

3

2

1

x5

x4

x3

x2

4

x1

2

3

1

x5

x4

x3

x2

3

x1

2

4

1

x5

x4

x3

x2

The minimum evolution (ME) criterion: The tree with the shortest TreeLen is the best tree.

OTUs ofnumber n where

32

1

n

iixTreeLen