Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sequence Analysis '18lecture 12
• T-matrix
• Least squares
• Maximum likelihood
• Bootstrapping
The T-matrix
b1A
B
C
Db2
b3
b4
b5
Tb = ddAB
dAC
dAD
dBC
dBD
dCD
=
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
b1
b2
b3
b4
b5
The T matrix expresses the branches needed to sum the tree distance. Branch distances b are found by squaring and inverting the T matrix and multiplying by the sequence
distances d. New tree distances Δ are found by summing b, using T.
dAB = b1 + b2 dAC = b1 + b3 + b5
etc
Least squares branch lengths using the T-matrix
S = Σwij( dij - Δij)2i>j
b1A
B
C
Db2
b3
b4
b5
(Tb = Δ) = d
dAB
dAC
dAD
dBC
dBD
dCD
=
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
b1
b2
b3
b4
b5
b = (TTT)-1(TTd)Solving for b we get:
empirical distances patristic distances
We want to minimize S
taxa
Least squares step 1: squaring the T-matrix
b1A
B
C
Db2
b3
b4
b5
Tb = d
=
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
b1
b2
b3
b4
b5
b = (TTT)-1(TTd)
357463
=
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
=
3 1 1 1 21 3 1 1 21 1 3 1 21 1 1 3 22 2 2 2 4
(TTT)
dAB
dAC
dAD
dBC
dBD
dCD
1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0
T-matrix • Branches = Distances
Solving for branches:TT T
b = (TTT)-1(TTd)=
TTT =
3 1 1 1 21 3 1 1 21 1 3 1 21 1 1 3 22 2 2 2 3
(TTT)-1 = 1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4
-1/4 -1/4 -1/4 -1/4 3/4
357463
(TTd)= =
1513121622
1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4
-1/4 -1/4 -1/4 -1/4 3/4
1513121622
=2.01.00.52.52.5
*To get the inverse, I used this server http://wims.unice.fr/wims/en_home.html
*
Least squares step 2: inverting, solving for b
1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0
b = (TTT)-1(TTd)=
b1
b2
b3
b4
b5
2.0A
B
C
D1.0
0.5
2.5
2.5
ΔAB
ΔAC
ΔAD
ΔBC
ΔBD
ΔCD
357463
=
dAB
dAC
dAD
dBC
dBD
dCD
3.05.07.04.06.03.0
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
= =
Sequence distances Least squares patristic distances
S = Σ( dij - Δij)2 = 0.0i>j
2.01.00.52.52.5
2.01.00.52.52.5
How similar are sequence distances and least squares distances?
How good are least squares distances for the wrong tree
b1A B
C Db2
b3
b4
b5
b = (TTT)-1(TTd) =1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4
-1/4 -1/4 -1/4 -1/4 3/4
dAC
dAB
dAD
dBC
dCD
dBD
=
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
b1
b2
b3
b4
b5
=
537436
537436
= 1716131215
4.754.252.752.25-3.25
=
Least-squares produces a negative distance for b5
S = Σ( dij - Δij)2 = 58.5i>j
1/2 0 0 0 -1/40 1/2 0 0 -1/40 0 1/2 0 -1/40 0 0 1/2 -1/4
-1/4 -1/4 -1/4 -1/4 3/4
ΔAC
ΔAB
ΔAD
ΔBC
ΔCD
ΔBD
9.07.57.07.06.55.0
1 1 0 0 01 0 1 0 11 0 0 1 10 1 1 0 10 1 0 1 10 0 1 1 0
= = 4.754.252.752.250.0
set negative distance to zerovery bad
1 1 1 0 0 01 0 0 1 1 00 1 0 1 0 10 0 1 0 1 1 0 1 1 1 1 0
Remember Max Unweighted Parsimony?
A ATGGCTATTCTTATAGTACG B ATGGCTAGTCTTATATTACA C TTCACTAGACCTGTGGTCCA D TTGACCAGACCTGTGGTCCG E TTGACCAGTTCTCTAGTTCG TOTALS
1 0
0
0
0
2
1
1
....|....0....|....0
1
1
1
1
all trees
...
Maximum likelihood
Given a tree with branch lengths and MSA, sum the probabilities of the branches over all possible ancestor bases (or amino acids).
Probabilities may be J-C or Kimura for nucleotides, or PAM or BLOSUM for proteins.
May be used in combination with NNI, SPR, TBR to evaluate tree topology, with or without branch lengths (if no branch lengths, assume all branch length are equal).
May be used to predict ancestral sequences.
A C
A C? ?
Sum the total likelihood of all branches over all
possible bases at two ? positions.
is maximum weighted parsimony.
Scoring quartets using Parsimony
Find quartet AHJP in this tree.
Score each tree for AHJP
J
H
P
A
AB
CD E F G H I J K L
M
NP
Q
RS
T
J
H
P
A
J
H P
A
A TTCJ TTTH GTCP GGT
taxon 1 2 3
A T T C
J T T TH G T CP G G T mutations *Prob
x T T C 4y G T Cx G T C
5y G T Tx T T C 4y T T T
1
2
3
x y
y
y
x
x
1
2
3
sequence position
*Maximum likelihood is parsimony taken to the next level. Instead of picking the most parsimonious ancestor sequence, we sum the probabilities of all options. We choose the tree that maximizes this sum.
Given this sequence data which quartet is most likely?
Jukes-Cantor P(mutation)
Maximum likelihood
L = Σ P( distr_x, distr_y)all branches x,y
J
H
P
Ax y
Max Likelihood Algorithm:For all candidate trees:
Iterate until convergenceFor each ancestor node x
Calculate new distr_xcheck for convergence
Calculate L (see above)
ML = max(L)all trees
Quartet puzzling. Tree based on ML quartets
1. Identify all choose-4 subsets of MSA => “quartets”2. For each quartet, find the best 4-taxa tree (3 possibilities) using least-
squares, Fitch-Margoliash, Parsimony, or Maximum Likelihood.3. Start with a randomly selected quartet, ((A,B),(C,D))4. Choose one new taxon, E.5. For every branch, try adding the E to the branch.
1. For all quartets containing E and three of the other taxa, ask whether the split is correct. If not, add 1 to the penalty.
6. Add E to the branch with the lowest penalty.7. Continue at step 4 with a new taxon.
b1A
B
C
Db2
b3
b4
b5
E
penalty for adding E to branch
quartet b1 b2 b3 b4 b5
AB,CD - - - - -
AE,BC 0 1 1 1 1
AE,BD 0 1 1 1 1
AE,CD 0 0 1 1 0
BE,CD 0 0 1 1 0
QP can be repeated, starting with different quartet, to generate any number of trees.
true tree
Expert tree strategy• Get sequence distances.• QP with ML -- Use ML to choose topology for
each quartet. Use QP to build trees from quartets.• NNI, SPR, TBR with L-S, F-M -- Modify the
tree, calculate branch lengths, calculate patristic distances, calculate difference distance score S.
• Consensus tree -- if more than one tree has same best score, merge branches.
• Confidence — For each branch, assign a “bootstrap” value.
“Boot strap analysis”• A method to validate the significance of a
phylogenetic tree, branchpoint by branchpoint. • Requires a means to generate independent
trees. (For example using distances generated from different regions of the mitochondrial genome or from random subsets of sequences or random subsets of columns.)
• Choose the representative tree as the ‘parent’. Calculate the following:
For each branchpoint in the parent tree, For each tree, ask
Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)
Bootstrap value = number of trees having the branchpoint / total trees.
Comparing branchpoints
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
bootstrap value= P((A,B),C) = 5/8For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order) Bootstrap value = number of trees having the branchpoint / total trees.
0.625
Comparing branchpoints
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
bootstrap value = P((A,B,C),(D,E))=6/8
Bootstrap values for this data
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
.75
.75
.625
.875
Compare lineages instead of branchpoints.
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
= P(A,B) = 6/8
For unrooted trees or rooted trees. •Treat any lineage as the root.•Ask how often the root branching is conserved.
!19
Review• What are patristic distances? • How can you tell if a tree is non-additive? What does non-
additivity imply? • How do you compare two trees that have different
topology? • How do you score a set of distances against a tree? • Assuming an initial tree, what ways exist for exploring other
tree topologies? • Describe nearest neighbor interchange. • What kind of program is quartet puzzling? • ... Fitch-Margoliash? • Compare and contrast three ways to get branch lengths
from distances: UPGMA, Fitch-Margoliash, Least-squares.
A
B
CD
E
F
G
H
J
K
JG
D
A
E
B
H
K
FC
1.: What is the split distance?Exercise 9
2.: Draw these quartets, both trees.ACDKJFKHGABC
3: Write the T-matrix
b1
b2
b3 b4
b5
b6 b7A
B C
D
Eb1b2b3b4b5
b6
b7
dAB
dAC
dAD
dAE
dBC
dBD
dBE
dCD
dCE
dDE
=
4: Square the T-matrix
E
=
5: Invert the squared T-matrix
6: Convert distances to TTD
=
-1
39
111210121312137
=
TT
D
TTD
TTT-1
Exercise 10 (continuation of Exercise 9)
!23
7: Solve for branches b.
TTDTTT-1
b1
b2
b3
b4
b5
b6
b7
=
8: Draw unrooted phylogram.
9: Write Newick format, with distances.