Week 7: Bayesian inference, Testing trees, Bootstraps · To draw trees from a distribution whose probabilities are proportional to f(t), we can use the Metropolis algorithm: 1. Start

Week 7: Bayesian inference, Testing trees, Bootstraps

Genome 570

February, 2012

Week 7: Bayesian inference, Testing trees, Bootstraps – p.1/65

Bayes’ Theorem

Conditional probability of hypothesis given data is:

Prob (H | D) =Prob (H & D)

Prob (D)

SinceProb (H & D) = Prob (H) Prob (D | H)

Substituting this in:

Prob (H | D) =Prob (H) Prob (D | H)

Prob (D)

The denominator Prob (D) is the sum of the numerators over all possible

hypotheses H

Prob (H | D) =Prob (H) Prob (D | H)∑H Prob (H) Prob (D | H)


A visual example of Bayes’ Theorem

P(H 1

) P(H ) P(H )2 3

2P(D | H )

2

P(H |D ) = 3 2

+ +

12P(D | H ) 2

P(D )| H3

prior probabilities

likelih

oods

posterior probability


A dramatic, if not real, example

Example: “Space probe photos show no Little Green Men on Mars!”

priors

posteriors

no

yes no

yes

noyes

no

yes

likelihoods

0

no

yes

1


Calculations for that example

Using the Odds Ratio form of Bayes’ Theorem:

Prob (H1|D)

Prob (H2|D)=

Prob (D|H1)

Prob (D|H2)

Prob (H1)

Prob (H2)

︸︷︷︸︸︷︷︸︸︷︷︸posterior likelihood priorodds ratio ratio odds ratio

For the odds favoring their existence, the calculation is, for the optimist

about Little Green Men:

4

1× 1/3

1=

4/3

1= 4 : 3

While for the pessimist it is

1

4× 1/3

1=

1/12

1= 1 : 12


With repeated observation the prior matters less

If we send 5 space probes, and all fail to see LGMs, since the probability

of this observation is (1/3)5 if there are LGMs, and 1 if there aren’t,

we get for the optimist about Little Green Men:

4

1× (1/3)5

1= =

4/243

1= 4 : 243

while for the pessimist about Little Green Men:

1

4× (1/3)5

1= =

1/972

1= 1 : 972


A coin tossing example

The

prior

0.0 0.2 0.4 0.6 0.8 1.0

p0.0 0.2 0.4 0.6 0.8 1.0

p

The

likelihood

function

0.0 0.2 0.4 0.6 0.8 1.00

0.0 0.2 0.4 0.6 0.8 1.0

The

posterior

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0 0.2 0.4 0.6 0.8 1.0

p

11 tosses with 5 heads 44 tosses with 20 heads


Markov Chain Monte Carlo sampling

To draw trees from a distribution whose probabilities are proportional to

f(t), we can use the Metropolis algorithm:

1. Start at some tree. Call this Ti.

2. Pick a tree that is a neighbor of this tree in the graph of trees. Callthis the proposal Tj.

3. Compute the ratio of the probabilities (or probability densityfunctions) of the proposed new tree and the old tree: trees:

R =f(Tj)

f(Ti)

4. If R ≥ 1, accept the new tree as the current tree.

5. If R < 1, draw a uniform random number (a random fraction between0 and 1). If it is less than R, accept the new tree as the current tree.

6. Otherwise reject the new tree and continue with tree Ti as thecurrent tree.

7. Return to step 2.


Reversibility (again)

πi

jπPij

Pji


Does it achieve the desired equilibrium distribution?

If f(Ti) > f(Tj), then Prob (Ti | Tj) = 1, and Prob (Tj | Ti) = f(Tj)/f(Ti)so that

Prob (Tj | Ti)

Prob (Ti | Tj)=

f(Tj)

f(Ti)

(the same formula can be shown to hold when f(Ti) < f(Tj) ). Then in

both casesf(Ti) Prob (Tj | Ti) = Prob (Ti | Tj) f(Tj)

Summing over all Ti, the right-hand side sums up to f(Tj) so

∑

Ti

f(Ti) Prob (Tj | Ti) = f(Tj)

This shows that applying this algorithm, if we start in the desired

distribution, we stay in it. It is also true under most conditions that if youstart in any other distribution, you converge to this one.


Bayesian MCMC

We try to achieve the posterior

Prob (T) Prob (D | T) / (denominator)

and this turns out to need the acceptance ratio

R =Prob (Tnew) Prob (D | Tnew)

Prob (Told) Prob (D | Told)

(the denominators are the same and cancel out. This is a great

convenience, as we often cannot evaluate the deonominator, but we canusually evaluate the numerators).

Note that we could also have a prior on model parameters too, and as wemove through tree space we could also be moving through parameterspace.


Mau and Newton’s proposal mechanism

A C BD EF ACB D EF ACB D EF

ACB D EF ACB D EFACB D EF


Using MrBayes on the primates data

Bovine

Lemur

Tarsier

Crab E.Mac

Rhesus Mac

Jpn Macaq

BarbMacaq

Orang

Chimp

Human

Gorilla

Gibbon

Squir Monk

Mouse

0.416

0.899

1.00

0.999

1.00

0.997

0.986

0.905

0.949

0.938

0.856

Frequencies of partitions (posterior clade probabilities)Week 7: Bayesian inference, Testing trees, Bootstraps – p.13/65

Issues to think about with Bayesian inference

Where do you get your prior from?




Are you assuming each branch has a length drawn

independently from a distribution? How wide a distribution?






Or is the tree drawn from a birth-death process? If so, what arethe rates of birth and death?






Or is the tree drawn from a birth-death process? If so, what arethe rates of birth and death?Or is there also a stage where the species studied are chosen

are selected from all extant species of the group? How do you

model that? Are you modelling biologists’ decision-makingprocesses?









Is your prior the same as your reader’s prior?










If not, what do you do?











Use several priors?











Use several priors?

Just give the reader the likelihood surface and let them provide

their own prior?


An example

Suppose we have two species with a Jukes-Cantor model, so that theestimation of the (unrooted) tree is simply the estimation of the branchlength between the two species.

We can express the result either as branch length t, or as the net

probability of base change

p =3

4

(1 − exp(−4

3t)

)


A flat prior on p

0.0 0.5 0.750.25

p


The corresponding prior on t

0 1 2 3 4 5

t

So which is the “flat prior”?Week 7: Bayesian inference, Testing trees, Bootstraps – p.17/65

Flat prior for t between 0 and 5

0 1 2 3 4 5

t


The corresponding prior on p

0.0 0.750.50.25

p


The invariance of the ML estimator to scale change

^t̂ = 0.383112

( p = 0.3 )

0 1 2 3 4 5−14

−12

−10

−8

−6

t

ln L

Likelihood curve for twhen 3 sites differ out of 10


The invariance of the ML estimator to scale change

0.0 0.2 0.4 0.6 0.8−14

−12

−10

−8

−6

ln L

p

p = 0.3^

^( t = 0.383112 )Likelihood curve for p

when 3 sites differ out of 10


When t has a wide flat prior

0

1

2

3

4

5

1 10 100 1000

T

t 95% interval

MLE

The 95% two-tailed credible interval for t with varioustruncation points on a flat prior for t


What is going on in that case is ...

0

T is so large this is < 2.5% of the area

T


The likelihood curve is nearly a normal distribution

for large amounts of data

θ

θ

θ from t(x), the "sufficient statistic"

the value for our data set

If we have large amounts of data, the values of parameters we need to tryare all very similar, and the shape of the distribution (which is nearlynormal) will not be too different for these values.




θ

θ







θ

θ







θ

θ







θ

θ







θ

θ







θ

θ







θ

θ





Curvatures and covariances of ML estimates

ML estimates have covariances computable from curvatures of theexpected log-likelihood:

Var[θ̂]≃ −1

/ (d2E(log(L))

dθ2

)

The same is true when there are multiple parameters:

Var[θ̂

]≃ V ≃ −C

−1

where

Cij = E

(∂2 log(L)

∂θi ∂θj

)


With large amounts of data, asymptotically

When the true value of θ is θ0,

θ̂ − θ0√v

∼ N (0, 1)

Since 1/v is the negative of the curvature of the log-likelihood:

ln L (θ0) = ln L(θ̂) − 1

2

(θ0 − θ̂)2

v

so that twice the difference of log-likelihoods is the square of a normal:

2(ln L(θ̂) − ln L (θ0)

)∼ χ2

1


Corresponding results for multiple parameters

ln L(θ0) ≃ ln L(θ0) −1

2(θ0 − θ)

TC (θ0 − θ)

(θ − θ0)TC (θ − θ0) ∼ χ2

p

so that the log-likelihood difference is:

2(ln L(θ̂) − ln L (θ0)

)∼ χ2

p

When q of the p parameters are constrained:

2(ln L(θ̂) − ln L (θ0)

)∼ χ2

q


Likelihood ratio interval for a parameter

−2620

−2625

−2630

−2635

−2640

5 10 20 50 100 200

Transition / transversion ratio

ln L

Inferring the transition/transversion ratio for an F84 model with the14-species primate mitochondria data set.


LRT of a molecular clock – how many parameters?

A B C D E

v2

v1

v3

v4

v5

v6

v7

v8

Constraints for a clock

v2

v1 =

v4

v5=

v3

v7

v4

v8=+ +

v1

v6

v3=+

How does each equation constrain the branch lengths in the unrooted

tree? What about the red equation?


Likelihood Ratio Test for a molecular clock

Using the 7-species mitochondrial DNA data set (the great apes plus

Bovine and Mouse), we get with Ts/Tn = 30 and an F84 model:

Tree ln L

No clock −1372.77620

Clock −1414.45053

Difference 41.67473

Chi-square statistic: 2 × 41.675 = 83.35, with n − 2 = 5 degrees of

freedom – highly significant.


Model selection using the LRT

Jukes−Cantor K2P, T=2

K2P, T estimated

F84, T=2F81

25

26

28

29

27

F84, T estimated

Parameters

The problem with using likelihood ratio tests is the multiplicity of tests andthe multiple routes to the same hypotheses.


The Akaike Information Criterion

Compare between hypotheses −2 ln L + 2p (the same as reducing the

log-likelihood by the number of parameters)

Number ofModel ln L parameters AIC

Jukes-Cantor −3068.29186 25 6186.58K2P, R = 2.0 −2953.15830 25 5956.32

K2P, R̂ = 1.889 −2952.94264 26 5957.89F81 −2935.25430 28 5926.51F84, R = 2.0 −2680.32982 28 5416.66

F84, R̂ = 28.95 −2616.3981 29 5290.80


Can we test trees using the LRT?

0.0 0.1 0.2−200

−198

−196

−194

−192

A B C

A

A C B

B C

0.2 0.1

t

t

t

t t

ln L

ike

lih

oo

d

If so, how many degrees of freedom for the comparison of the two peaks?

These are three-species clocklike trees (shown here plotted in a “profilelog-likelihood plot” plotting the highest likelihood for each value of the

interior branch length). Week 7: Bayesian inference, Testing trees, Bootstraps – p.40/65

The bootstrapθ(unknown) true value of

(unknown) true distribution

estimate of θ

Distribution of estimates of parameters

150 data points

(each 150 draws)

empirical distribution of sample

Bootstrap replicates

An example with mixed normal distributions. Draw from the empiricaldistribution 150 times if there are 150 data points. With replacement!


The bootstrap for phylogenies

OriginalData

sequences

sites

Bootstrapsample#1

Bootstrapsample

#2

Estimate of the tree

Bootstrap estimate ofthe tree, #1

Bootstrap estimate of

sample same number

of sites, with replacement

sample same number

of sites, with replacement

sequences

sequences

sites

sites

(and so on)the tree, #2

Drawing columns of the data matrix, with replacement.


A partition defined by a branch in the first tree

Trees:

How many times each partition of species is found:

AE | BCDFACE | BDFACEF | BD 1AC | BDEFAEF | BCDADEF | BCABDF | ECABCE | DF

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C


Another partition from the first tree

Trees:



B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C

1


The third partition from that tree

Trees:



B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C

11


Partitions from the second tree

Trees:


AE | BCDF

ACEF | BD 1

AEF | BCDADEF | BCABDF | EC

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C

1ACE | BDF

AC | BDEF

ABCE | DF

B

DF

E

C

A

1

2

1


Partitions from the third tree

Trees:


ACE | BDFACEF | BDAC | BDEF 1

ABDF | ECABCE | DF

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

12

AE | BCDF

AEF | BCD 1ADEF | BC

B

DF

E

A C

2

1

1


Partitions from the fourth tree

Trees:


ACE | BDFACEF | BD 1AC | BDEF 1AEF | BCD 1

ABDF | EC

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C

2AE | BCDF 3

ADEF | BC 2

ABCE | DF

B

DF

E

C

A

2


Partitions from the fifth tree

Trees:


AE | BCDF 3

ACEF | BD 1AC | BDEF 1AEF | BCD 1ADEF | BC 2

B

DF

E

C

A

B

DF

E

C

A

ACE | BDF 3

ABDF | EC 1ABCE | DF 3

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C


The table of partitions from all trees

Trees:


AE | BCDF 3ACE | BDF 3ACEF | BD 1AC | BDEF 1AEF | BCD 1ADEF | BC 2ABDF | EC 1ABCE | DF 3

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C


The majority-rule consensus tree

C

A

Trees:


AE | BCDF 3ACE | BDF 3ACEF | BD 1AC | BDEF 1AEF | BCD 1ADEF | BC 2ABDF | EC 1ABCE | DF 3

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

C

A

B

DF

E

A C

B

DF

E60

60 60


How do we know that the MR consensus tree will be a tree?

Suppose that for each partition in a tree we construct a (fake)morphological character with 0 for one set in the partition, 1 for theother.




Such a character is compatible with a tree if (and only if) it containsthat partition.





If two of these characters both occur in more than 50% of the trees,they must co-occur in at least one tree.






Thus the set of these “characters” that occur in more then 50% ofthe trees are all pairwise compatible.







By the Pairwise Compatibility Theorem (remember that?) they mustthen be jointly compatible







By the Pairwise Compatibility Theorem (remember that?) they mustthen be jointly compatible

So there must be a tree that contains them all.


The MR tree with 14-species primate mtDNA data

Bovine

Mouse

Squir Monk

Chimp

Human

Gorilla

Orang

Gibbon

Rhesus Mac

Jpn Macaq

Crab−E.Mac

BarbMacaq

Tarsier

Lemur

80

72

74

99

99

100

77

42

35

49

84


Potential problems with the bootstrap

1. Sites may not evolve independently

2. Sites may not come from a common distribution (but can considerthem sampled from a mixture of possible distributions)

3. If do not know which branch is of interest at the outset, a“multiple-tests" problem means P values are overstated

4. P values are biased (too conservative)

5. Bootstrapping does not correct biases in phylogeny methods


Other resampling methods

Delete-half jackknife. Sample a random 50% of the sites, without

replacement.

Delete-1/e jackknife (Farris et. al. 1996) (too little deletion from a

statistical viewpoint).

Reweighting characters by choosing weights from an exponentialdistribution.

In fact, reweighting them by any exchangeable weights having

coefficient of variation of 1

Parametric bootstrap – simulate data sets of this size assuming the

estimate of the tree is the truth

(to correct for correlation among adjacent sites) (Künsch, 1989)

Block-bootstrapping – sample n/b blocks of b adjacent sites.


With the delete-half jackknife

Bovine

Mouse

Squir Monk

Chimp

Human

Gorilla

Orang

Gibbon

Rhesus Mac

Jpn Macaq

Crab−E.Mac

BarbMacaq

Tarsier

Lemur

80

99

100

84

98

69

72

80

50

59

32


Bootstrap versus jackknife in a simple case

Exact computation of the effects of

deletion fraction for the jackknife

n1

n2

n characters

n(1−δ) charactersm

1m

2

m2

>m1

Prob( )

m2

>m1

Prob( )

m2

>m1

Prob( )Prob(

m2

=m1

Prob( )+ 1

2

We can compute for various n’s the probabilities

of getting more evidence for group 1 than for group 2

A typical result is for n1

= 10, n2

= 8, n = 100 :

Bootstrap

Jackknife

δ = 1/2 δ = 1/e

0.6384

0.7230

0.6807

0.5923

0.7587

0.6755

0.6441

0.8040

0.7240

(suppose 1 and 2 are conflicting groups)


Probability of a character being omitted from a bootstrap

N (1 − 1/N)N

1 0 11 0.35049 25 0.36040 100 0.36603

2 0.25 12 0.35200 30 0.36166 150 0.36665

3 0.29630 13 0.35326 35 0.36256 200 0.36696

4 0.31641 14 0.35434 40 0.36323 250 0.36714

5 0.32768 15 0.35526 45 0.36375 300 0.36727

6 0.33490 16 0.35607 50 0.36417 500 0.36751

7 0.33992 17 0.35679 60 0.36479 1000 0.36770

8 0.34361 18 0.35742 70 0.36524 ∞ 0.36788

9 0.34644 19 0.35798 80 0.36557

10 0.34868 20 0.35849 90 0.36583


A toy example to examine bias of P values

0

True value of mean

"Topology" II "Topology" I

True distribution of sample meansDistribution of individualvalues of x

of sample means

Estimated distributions

Assuming a normal distribution, trying to infer whether the mean is above0, when the mean is unknown and the variance known to be 1


Bias in the P values

estimate of the "phylogeny"

topology Itopology II 0

P

note that thetrue P is moreextreme thanthe average ofthe P’s

the true mean


How much bias in the P values?

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0

Avera

ge P

True P


Bias in the P values with different priors

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.50 1.00

Pro

babili

ty o

f corr

ect to

polo

gy

1.0

0.1

2.02

= n σ

µP for expectation of


The parametric bootstrap

original

data

estimate

of tree

data

set #1

data

data

data

set #2

set #3

set #100

computer

simulation

estimation

of tree

T1

T

T

2

T3

100


The parametric bootstrap with the primates data

Bovine

Lemur

Tarsier

Squirrel Monkey

Mouse

Jp Macacque

Barbary Mac

Crab−Eating Mac

Rhesus Mac

Gorilla

Chimp

Human

Orang

Gibbon

96

95

96

93

98

82

98

81

78

83

98


Goldman’s test using simulation

data

T

Tc

no clock

clock

l

l

l − l2 (

. . .

data data data . . . data

estimating clocklike and nonclocklike trees from each data set ...

l − l2 ( l − l2 ( l − l2 (

l − l2 (

simulating data sets

. . .

(related to the "parametric bootstrap")

c )l − l2 ( )

c )

c )

c

c )

) c

c

log−likelihoodtree


Documents

Week 7: Bayesian inference, Testing trees, Bootstraps · To draw trees from a distribution whose probabilities are proportional to f(t), we can use the Metropolis algorithm: 1. Start