View
3
Download
0
Category
Preview:
Citation preview
Week 3: Parsimony variants, compatibility, statistics andparsimony
Genome 570
January, 2012
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.1/46
Camin-Sokal parsimony
1 0 1
0
0 1 0
0 1
1
1
4 changes
Unidirectional change. Easy to reconstruct ancestors and count states.
This scheme is of importance in comparative genomics, with 1 = presenceof a deletion.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.2/46
Dollo parsimony
1 0 1
0
0 1 0
0 1
1
1
4 changes
(3 losses)
Assumes that it is much more difficult to gain state 1 than to lose it. Has
been used to model gain and loss of restriction sites.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.3/46
Polymorphism parsimony
1 0 1
0
0 1 0
0 1
1
1
5 retentions ofpolymorphism
01
Note that in this case the retention (not the loss) of the polymorphic state(01) is considered unlikely and is penalized.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.4/46
nonadditive binary coding
0
1
2
3
4
original
new characters
0
0
0
0
0
a b
0
1
1
1
1
c
−1 1 0 0 0 0
d e
0 0 0
0
1
1
0
0
0
1
0
0
0
0
1
−1
0
1
2 3
4
a b
c d
e
Using multiple 0/1 characters to construct a case with the sameparsimony score on all trees as a single multistate character with a
“character state tree”
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.5/46
Dollo parsimony – a paradox
−1
0
1
2 3
4 A B C D
3 32 4
y
x
With a character-state tree, can you always find a Dollo parsimonyreconstruction that has only one origin of state 2 ?
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.6/46
Weighting using a function
Suppose that each step is to be weighted 1/(n + 2) if the character has a
total of n steps.
Then the total weight of steps when there are n steps in that character is
n × 1/(n + 2) which is n/(n + 2)
Total steps in Weight of Total weightthe character each step of all steps
0 0 01 0.3333 0.33332 0.25 0.53 0.2 0.64 0.1667 0.66675 0.142857 0.71428
The rationale for doing this is to somewhat discount the information frommore rapidly changing characters.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.7/46
Successive weighting
J.S. Farris suggested (1969):
1. Infer a tree with unweighted parsimony
2. Calculate weights for each character (or site) based on the number
of changes in that character on that tree
3. Use those weights to infer a new tree
4. Unless the tree hasn’t changed, go back to step 2.
This method, which was put forward in the first modern quantitative paperon weighting, was intended to discount rapidly-evolving characters.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.8/46
Successive weighting
There are 15 possible unrooted trees, which fall into 5 types according tohow many changes they have in each character. The table shows the totalweighted number of changes when each tree type is evaluated using the
weights implied by the 5 different tree types.
number have pattern type tree type used for weights
of trees of changes: of tree I II III IV V
1 (1,1,1,2,2,1) I 2.333 2.250 2.167 2.417 2.083
2 (1,2,1,2,2,1) II 2.667 2.500 2.500 2.667 2.333
2 (2,1,2,2,2,1) III 3.000 2.917 2.667 2.917 2.583
3 (2,2,2,1,1,1) IV 2.833 2.667 2.500 2.500 2.333
7 (2,2,2,2,2,1) V 3.333 3.167 3.000 3.167 2.833
From this table you can figure out what the sequence of trees will be if you
start from one of these types. Do you get different ultimate outcomes?
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.9/46
Ties in successive weighting
An example of successive weighting that would show the difficulty it has in
detecting ties. The table shows the total weighted number of changeswhen each tree type is evaluated using the weights implied by the differenttree types.
number have pattern type tree type used for weights
of trees of changes: of tree I II III
1 (1,1,2,2) I 1.667 1.833 1.5
1 (2,2,1,1) II 1.833 1.667 1.5
1 (2,2,2,2) III 2.333 2.333 2
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.10/46
Nonsuccessive weighting
We can use a different algorithm to avoid the issue of dependence onstarting point which is seen in successive weighting:
Search over tree space in the usual way
For each tree, evaluate the number of steps in each character (orsite)
Calculate the weight for the character from its number of steps onthe current tree in the search.
Add up the weighted steps across characters to evaluate that tree.
This is equivalent to looking only at the diagonals in the preceding tablesof trees. Weights are never based on a different tree.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.11/46
Evaluating compatibility with 0/1 characters
E. O. Wilson (Systematic Zoology, 1965) pointed out that for two characters,if all four combinations 00, 01, 10 and 11 exist in one or another species,the two characters must be incompatible with each other, in the sense thatthere can be no tree on which they each change only once (so the derived
state is uniquely derived).
Compatiblecharacters
0 1
0 X X1 X
Incompatiblecharacters
0 1
0 X X1 X X
The logic is straightforward: to originate a new combination requires a
step in one character (or both). With four combinations there are 3 stepsrequired, one more than the number that would be needed if bothcharacters were incompatible. (In genomics this is called the
“three-gamete condition” – out of the four possibilities there must be three
or fewer for there to have been no recombination and no recurrentmutations).
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.12/46
Data example for compatibility
The data set Table 1.1 with an added species all of whose characters are0.
Characters
Species 1 2 3 4 5 6
Alpha 1 0 0 1 1 0
Beta 0 0 1 0 0 0
Gamma 1 1 0 0 0 0
Delta 1 1 0 1 1 1
Epsilon 0 0 1 1 1 0
Omega 0 0 0 0 0 0
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.13/46
The compatibility matrix for this data set
1 2 3 4 5 61 2 3 4 5 6
123456
The darkly-shaded boxes are the combinations that are compatible.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.14/46
The Pairwise Compatibility Theorem
A set S of characters has all pairs of characterscompatible with each other if and only if all ofthe characters in the set are jointly compatible (inthat there exists a tree with which all of them arecompatible).
This theorem is true, given the right conditions, which we will explain.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.15/46
The compatibility graph
Maximal cliques? Largest?
1
2 3
4
56
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.16/46
One of the cliques it contains
1
2 3
4
56
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.17/46
A maximal clique
1
2 3
4
56
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.18/46
The largest maximal clique
1
2 3
4
56
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.19/46
Making the tree by “tree popping”
Alpha
Beta
Gamma
Delta
Epsilon
AlphaBeta
Gamma
DeltaEpsilon
4 5
AlphaBetaGammaDeltaEpsilon
0011
10011
1
Character 1Character 3
1 2 3 6
101
01
11
00
0
1
1
0
00
00010
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.20/46
Making the tree by “tree popping”
Gamma
Delta
Alpha
Beta
Gamma
Delta
Epsilon
AlphaBeta
Gamma
DeltaEpsilon
Beta
EpsilonAlpha
4 5
AlphaBetaGammaDeltaEpsilon
0011
10011
1
Character 1
Character 2
Character 3
1 2 3 6
101
01
11
00
0
1
1
0
00
00010
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.21/46
Making the tree by “tree popping”
Gamma
DeltaGamma
Alpha
Beta
Gamma
Delta
Epsilon
AlphaBeta
Gamma
DeltaEpsilon
Beta
EpsilonAlpha
Delta
Beta
EpsilonAlpha
4 5
AlphaBetaGammaDeltaEpsilon
0011
10011
1
Character 1
Character 2
Character 3
Character 6
1 2 3 6
101
01
11
00
0
1
1
0
00
00010
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.22/46
Making the tree by “tree popping”
Alpha
Beta
Gamma
Delta
Epsilon
AlphaBeta
Gamma
DeltaEpsilon
Beta
EpsilonAlpha
Alpha
Gamma
Delta
Beta
Epsilon
Tree is:
Gamma
Delta
Gamma
Delta
Beta
EpsilonAlpha
4 5
AlphaBetaGammaDeltaEpsilon
0011
10011
1
Character 1
Character 2
Character 3
Character 6
1 2 3 6
101
01
11
00
0
1
1
0
00
00010
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.23/46
Unknown states and compatibility
A data set that has all pairs of characters compatible, but which cannothave all characters compatible with the same tree. This violates thePairwise Compatibility Theorem, owing to the unknown ("?") states.
Alpha 0 0 0
Beta ? 0 1
Gamma 1 ? 0
Delta 0 1 ?
Epsilon 1 1 1
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.24/46
Multiple character states and compatibility
Walter Fitch’s set of nucleotide sequences that have each pair of sitescompatible, but which are not all compatible with the same tree.
Alpha A A A
Beta A C C
Gamma C G C
Delta C C G
Epsilon G A G
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.25/46
Transition probabilities in a two-state model
In a symmetric two-state model with rate of change r per unit time, where
0 1r
r we have Prob (1 | 0, t, r) = 12(1 − e
−2 r t)
0.0 0.5 1.0 1.5 2.00.0
0.2
0.4
0.6
0.8
1.0
2 (1 1 − e
r t
r tP
rob
ab
ilit
y o
f c
ha
ng
e
)− 2rt
When rt is small then to very good approximation the probability of theevents in the branch is rt or 1 − rt . The latter is nearly 1.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.26/46
Likelihood and parsimony
L = Prob (Data|Tree)
=chars∏
i=1
∑
recon−structions
(
12
B∏
j=1
{
ritj if this character changes1 − ritj if it does not change
}
)
The sum is over all reconstructions of changes in that character that result
in the observed states at the tips of the tree. It is not just over all mostparsimonious reconstructions of changes.
We will see (later in the course) that maximizing the likelihood is a Good
Thing. We want to show that in the case where changes are infrequent,
there is a relationship between that and minimizing the (weighted)parsimony score, and what the proper weights turn out to be.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.27/46
approximating ...
Since the probabilities of change are (nearly) rt, and the probabilities of
non-change are (nearly) 1, for each character (i) the likelihood isapproximately the product over all branches (the jth of which has length
tj)
1
2
B∏
i=1
(ritj)nij .
For them all it is approximately
L ≃chars∏
i=1
1
2
branches∏
j=1
(ritj)nij
.
Tossing out the 1/2 terms and taking logarithms, if the number of
changes in character i in branch j is nij,
− ln L ≃
chars∑
i=1
branches∑
j=1
nij (− ln (ritj)) .
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.28/46
Weights and likelihood
Probabilities of change and resulting weights for an imaginary case.
Character ri tj changes total weight
1 0.01 1 4.6052 0.01 2 9.2103 0.00001 1 11.519
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.29/46
with one-tenth as much change ...
Character ri tj changes total weight
1 0.001 1 6.9082 0.001 2 13.8163 0.000001 1 13.816
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.30/46
... and with one-tenth less again
Character ri tj changes total weight
1 0.0001 1 9.2102 0.0001 2 18.4213 0.0000001 1 16.118
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.31/46
Trees, steps, and patterns
0 0 0
1 1 1
1 1 1
1 1 1
1 1 1
0000
0001
0010
0100
0111
1000
1011
1101
1110
1111
1 1 1
1 1 1
1 2 2
1 2 2
0011
0101
0110
1001
1010
22 122 1
22 122 1
1100
1 1 1
111
0 0 0
a c
b d
a b
c d
a b
d c
This table shows how many changes of state (steps) are needed for each
of the 16 possible character patterns on each of the three unrooted treetopologies, under a parsimony criterion.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.32/46
Trees, steps, and patterns with DNA
AAAA
AAAC
AAAG
AAAT
AACA
AACG
AACT
AAGA
AAGC
AAGT
AATA
AATC
AATG
ACAA
ACAG
ACAT
ACCC
TTTT
...
0 0 0
1 1 1
1 1 1
1 1 1
1 1 11 1 1
2 2 2
2 2 2
1 1 1
2 2 2
2 2 2
1 1 1
2 2 2
2 2 2
1 1 1
2 2 2
2 2 2
AACC
AAGG
AATT
ACAC
ACCA
1 2 2
1 2 2
1 2 2
2 1 2
2 2 11 1 1
0 0 0
...
a c a b a b
b d c d d c
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.33/46
Parsimony and patterns
Ignoring all the patterns that have the same number of steps on all
topologies, the ones that matter to a parsimony method have one of thesethree sums:
nxxyy + 2nxyxy + 2nxyyx = 2(nxxyy + nxyxy + nxyyx) − nxxyy
2nxxyy + nxyxy + 2nxyyx = 2(nxxyy + nxyxy + nxyyx) − nxyxy
2nxxyy + 2nxyxy + nxyyx = 2(nxxyy + nxyxy + nxyyx) − nxyyx.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.34/46
The counterexample
p pq
q q
A
B
C
D
The long branches have probabilities of change p, the short branches
have probabilities of change q. This is the canonical case of “long branchattraction”.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.35/46
Pattern probabilities
If tip pattern is 1100, and internal nodes are 1, 1 the probability is
1
2(1 − p)(1 − q)(1 − q)pq
and in general summing over all four possibilities:
P1100 = 12
(
(1 − p)(1 − q)2pq + (1 − p)2(1 − q)2q
+ p2q3 + pq(1 − p)(1 − q)2)
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.36/46
Pattern probabilities
Pxxyy = (1 − p)(1 − q)[q(1 − q)(1 − p) + q(1 − q)p]
+ pq[(1 − q)2(1 − p) + q2p]
Pxyxy = (1 − p)q[q(1 − q)p + q(1 − q)(1 − p)]
+ p(1 − q)[p(1 − q)2 + (1 − p)q2]
Pxyyx = (1 − p)q[(1 − p)q2 + p(1 − q)2]
+ p(1 − q)[q(1 − q)p + q(1 − q)(1 − p)]
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.37/46
Taking differences
Pxyxy − Pxyyx = (1 − 2q)[
q2(1 − p)2 + (1 − q)2p2]
Which is always positive as long as q < 1/2 and either p or q is
positive. Thus Pxyxy > Pxyyx so we don’t need to concern ourselves with
Pxyyx.
To have Pxxyy be the largest of the three, we only need to know that
Pxxyy − Pxyxy > 0
and after a struggle that turns out to require
(1 − 2q)[
q(1 − q) − p2]
> 0
which (provided q < 1/2) is true if and only if
q(1 − q) > p2
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.38/46
Conditions for inconsistency
p
q0.1 0.2 0.3 0.4 0.50
0.1
0.2
0.3
0.4
0.5
0
consistent
inconsistent
p2< q (1−q)
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.39/46
Example for patterns with DNA
root
z
p
w
p
q
1(C) 3(A)
4(A)2(C)
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.40/46
Calculating a pattern frequency
Prob [CCAA] = 118
(1 − p)(1 − q)2pq + 127
pq2(1 − p)(1 − q)
+ 1162
p2q2(1 − q) + 7972
p2q3
+ 112
(1 − p)2(1 − q)2q.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.41/46
Pattern frequencies
Prob [xxyy] = (1 − p)2q(1 − q)2 + 23p(1 − p)q(1 − q)2
+ 49p(1 − p)q2(1 − q) + 2
27p2q2(1 − q)
+ 781
p2q3
Prob [xyxy] = 13(1 − p)2q2(1 − q) + 2
9p(1 − p)q2(1 − q)
+ 427
p(1 − p)q3 + 13p2(1 − q)3
+ 29p2q2(1 − q) + 2
81p2q3
Prob [xyyx] = 181
(1 − p)2q3 + 23p(1 − p)q(1 − q)2
+ 49p(1 − p)q3 + 1
9p2q(1 − q)2
+ 627
p2q2(1 − q) + 281
p2q3.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.42/46
Conditions for inconsistency with DNA
p <−18 q + 24 q2 +
√
243 q − 567 q2 + 648 q3 − 288 q4
9 − 24 q + 32 q2
(This will not be on the test).
For small p and q this is approximately
1
3p2 < q
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.43/46
Conditions for inconsistency with DNA
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00.2 0.4 0.60
q
p
inconsistent
consistent
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.44/46
Inconsistency with a clock
A B C D E A B C D E
x x
Cases like this were discovered by Michael Hendy and David Penny in1989.
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.45/46
Example showing inconsistency with a clock
200
100
0100 200 500 1000 2000
sitestr
ees c
orr
ect out of 1000
b
b
E
A
B
C
D
0.50
0.30
0.20−b
b = 0.03
b = 0.02
Week 3: Parsimony variants, compatibility, statistics and parsimony – p.46/46
Recommended