Journal Club @ UVigo 2011.07.22

Journal Club – Bayes Estimators for PhylogeneticReconstruction

Syst. Biol. 60(4), 528 – 540, 2011 doi 10.1093/sysbio/syr021

Leonardo de O. Martins

University of Vigo

July 22, 2011

Leo Martins (Univ. Vigo) Journal Club 22/7 1 / 12

Outline

1 Distance as a penalty

2 Distances, everywhere

3 No phylogenetics, yet...

4 Trees as points in space

5 To the paper, then


Statistical Risk

The risk ρ associated with a decision θ̂ is the expected loss of this decisionθ̂ (which can be, for instance, an estimate of θ).

ρ(θ̂) =

∫L(θ, θ̂) P(θ | data) dθ

(promptly called posterior expected loss)

The loss function L(θ, θ̂) is a penalty we give for ”deciding” away from theparameter. Examples are the squared loss and the absolute loss.

For some loss functions, we can calculate what is the best decision (i.e.the one that minimizes the risk, for any data).


Statistical Risk


ρ(θ̂) =






Statistical Risk


ρ(θ̂) =






Statistical Risk


ρ(θ̂) =






Outline







How to summarise a collection of objects?

scattered points

library(MASS);

x <- mvrnorm (n=1000 , mu=c(0,0), Sigma = matrix (c(1, 0.8, 0.9, 1), 2, 2, byrow=T));

plot (x[,1], x[,2], pch= ".", cex = 2, xlab="x", ylab="y");



centroid: minimizes a distance to all points

library(MASS);





regression line: minimizes a distance to all points

library(MASS);




Outline







How to summarise the posterior distribution P(X)?



Posterior mean

Minimize the expected loss under a squared loss function

L(θ, θ̂) = (θ − θ̂)2

(Euclidean distance)



Posterior median

Minimize the expected loss under a linear loss function

L(θ, θ̂) =| θ − θ̂ |

(Manhattan distance)



Posterior mode

a.k.a. Maximum A Posteriori (MAP) estimate.Minimize the expected loss under a delta loss function

L(θ, θ̂) =

{0, for θ = θ̂

1, for θ 6= θ̂


Outline







Distances between trees

��BBBB

PPPP��

XXXXX

A

B

CD

E

��BBBB

PPPP��

XXXXX

A

B

ED

C

Trees from the article



��BBBB

PPPP��

XXXXX

A

B

CD

E

��BBBB

PPPP��

XXXXX

A

B

ED

C

RF distance

DE|ABC and CD|ABEtotal 2 branches



��BBBB

PPPP��

XXXXX

A

B

CD

E

��BBBB

PPPP��

XXXXX

A

B

ED

C

Quartet distance

AC|DE and AE|CDBC|DE and BE|CD4 quartets are different



��BBBB

PPPP��

XXXXX

A

B

CD

E

��BBBB

PPPP��

XXXXX

A

B

ED

C

Quartet distance

AC|DE and AE|CDBC|DE and BE|CD4 quartets are different



��BBBB

PPPP��

XXXXX

A

B

CD

E

��BBBB

PPPP��

XXXXX

A

B

ED

C

Path difference (number of speciations between trees)

path from A to E is one edge longer in one tree than the other

(...)

the overall difference is 6


Outline







If there is a distance, there is a Bayes estimator

For points in Rn, we know that the mean minimizes the Euclideandistance, etc.

For phylogenies:

there are several Euclidean distances

the mean does not work since a tree has restrictions

But some distances between trees also lead to “analytical” solutions:

the consensus tree minimizes the Robinson-Foulds distance betweenthe samples

the quartet puzzling minimizes the quartet distance

the Buneman tree minimizes (I think) the dissimilarity map distance

some of these are hard to solve as well




For phylogenies:











For phylogenies:











For phylogenies:











For phylogenies:











For phylogenies:









How do they find, then, the Bayes estimates?

like many other softwares: hill-climbing on the space of possibletopologies

their input data is the posterior distribution of trees from MrBayes

starting tree can be NJ, MAP tree, ML...

apply branch-swap (NNI) to current optimal tree, then verify distanceto all samples

the distance used is the path difference (matrix subtraction)don’t need to recalculate distance to all samples, just to matrix withaverage values




























the distance used is the path difference (matrix subtraction)

don’t need to recalculate distance to all samples, just to matrix withaverage values









Education

Journal Club @ UVigo 2011.07.22