Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1/27/09
1
CSCI1950‐Z Computa4onal Methods for Biology
Lecture 2
Ben Raphael January 26, 2009
hHp://cs.brown.edu/courses/csci1950‐z/
Outline
• Review of trees. Coun4ng features. • Character‐based phylogeny
– Maximum parsimony
– Maximum likelihood
1/27/09
2
Tree Defini4ons tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each edge (vi, vj) connects a pair of vertices.
A path in G is a sequence (v1, v2, …, vn) of vertices in V such that (vi, vi+1) are edges in E.
A graph is connected provided for every pair vi vj of vertices, there is a path between vi and vj.
A cycle is a path with the same starting and ending vertices.
A graph is acyclic provided it has no cycles.
Tree Defini4ons degree of vertex v is the number of edges incident to v.
A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).
A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; (one parent and two children).
A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2).
w is a parent (ancestor) of v provided (v,w) is on path to root. In this case v is a child (descendant) of w.
1/27/09
3
Tree Defini4ons tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v.
A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).
• Leaves represent existing species • Other vertices represent most recent common ancestor. • Length of branches represent evolutionary time. • Root (if present) represents the oldest evolutionary ancestor.
Coun4ng and Trees
• A tree with n ver4ces has n‐1 edges. (Proof?) • A rooted binary phylogene4c tree with n leaves has n‐1 internal ver4ces; and thus 2n ‐1 total ver4ces.
• How many rooted binary phylogene4c trees with n leaves?
1/27/09
4
Character‐based Phylogene4c Tree Reconstruc4on
1. What is character data? 2. What is the criteria for evalua6ng a tree? 3. How do we op6mize this criteria:
1. Over all possible trees? 2. Over a restricted class of trees?
Output
Op6mal phylogene4c tree
Algorithm
Input
Characters
Molecular
Morphological
Character‐Based Tree Reconstruc4on
• Characters may be nucleo4des of DNA (A, G, C, T) or amino acids (20 leHer alphabet).
• Values are called states of character.
• Characters may be morphological features # of eyes or legs or the shape of a beak or a fin.
Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA
2‐state character
Non‐informa4ve character
1/27/09
5
Character‐Based Tree Reconstruc4on
GOAL: determine what character strings at internal nodes would best explain the character strings for the n observed species
An Example
Value1 Value2
Mouth Smile Frown
Eyebrows Normal Pointed
1/27/09
6
Character‐Based Tree Reconstruc4on
Which tree is beAer?
Character‐Based Tree Reconstruc4on
Count changes on tree
1/27/09
7
Character‐Based Tree Reconstruc4on
Maximum Parsimony: minimize number of changes on edges of tree
Maximum Parsimony
• Ockham’s razor: “simplest” explana4on for the data
• Assumes that observed character differences resulted from the fewest possible muta4ons
• Seeks tree with the lowest possible parsimony score, defined sum of cost of all muta4ons found in the tree
1/27/09
8
Character Matrix
Given n species, each labeled by m characters. Each character has k possible states.
n x m character matrix
Assume that characters in character string are independent.
Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA
Parsimony Score
Assume that characters in character string are independent.
Given character strings S=s1…sm and T=t1…tm: #changes (S T) = Σi dH(si, ti)
where dH = Hamming distance dH(v, w) = 0 if v=w dH(v, w) = 1 otherwise
parsimony score of the tree as the sum of the lengths (weights) of the edges
Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA
1/27/09
9
Parsimony and Tree Reconstruc4on
Maximum Parsimony
Two computa4onal sub‐problems: 1. Find the parsimony score for a fixed tree.
– Small Parsimony Problem (easy)
2. Find the lowest parsimony score over all trees with n leaves. – Large parsimony problem (hard)
1/27/09
10
Small Parsimony Problem
Input: Tree T with each leaf labeled by an m‐character string.
Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.
Since characters are independent, every leaf is labeled by a single character.
Small Parsimony Problem
Input: T: tree with each leaf
labeled by an m‐character string.
Output: Labeling of internal ver4ces
of the tree T minimizing the parsimony score.
Input: M: an n x m character
matrix. Output: A tree T with: • n leaves labeled by the n
rows of matrix M • labeling of the internal
ver4ces of T minimizing the parsimony
score over all possible trees and all possible labelings of internal ver4ces
Large Parsimony Problem
1/27/09
11
Small Parsimony Problem
Input: Binary tree T with each leaf labeled by an m‐character string.
Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.
Since characters are independent, every leaf is labeled by a single character.
Weighted Small Parsimony Problem
More general version of Small Parsimony Problem
• Input includes a k x k scoring matrix δ describing the cost of transforming each of k states into another state.
• Small Parsimony Problem is special case:
δij = 0, if i = j, 1, otherwise.
1/27/09
12
Scoring Matrices
A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0
A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0
Small Parsimony Problem Weighted Small Parsimony Problem
Unweighted vs. Weighted
Small Parsimony Scoring Matrix:
A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0
Small Parsimony Score: 5
1/27/09
13
Unweighted vs. Weighted
Weighted Parsimony Scoring Matrix:
A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0
Weighted Parsimony Score: 22
Weighted Small Parsimony Problem
Input: T: tree with each leaf labeled by an m‐character string from a k‐leHer alphabet.
δ: k x k scoring matrix
Output: Labeling of internal ver4ces of the tree T minimizing the weighted parsimony score.
1/27/09
14
Sankoff Algorithm
Calculate and keep track of a score for every possible label at each vertex: st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t
t
…. ….
st(v)
Sankoff Algorithm
st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t
The score st(v) is based only on scores of its children: st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t}
t
sj(right child) si(leo child)
δj, t δi, t
1/27/09
15
Sankoff Algorithm (cont.)
• Begin at leaves: – If leaf has the character in ques4on, score is 0 – Else, score is ∞
Sankoff Algorithm (cont.)
st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}
sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}
si(u) δi, A sum
A 0 0 0
T ∞ 3 ∞
G ∞ 4 ∞
C ∞ 9 ∞
sA(v) = 0
1/27/09
16
Sankoff Algorithm (cont.)
st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}
sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}
sj(u) δj, A sum
A ∞ 0 ∞
T ∞ 3 ∞
G ∞ 4 ∞
C 0 9 9
+ 9 = 9
sA(v) = 0
Sankoff Algorithm (cont.)
st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}
Repeat for T, G, and C
1/27/09
17
Sankoff Algorithm (cont.)
Repeat for right subtree
Sankoff Algorithm (cont.)
Repeat for root
1/27/09
18
Sankoff Algorithm (cont.)
Smallest score at root is minimum weighted parsimony score
In this case, 9 – so label with T
Sankoff Algorithm: Traveling down the Tree
• The scores at the root vertex have been computed by going up the tree
• Aoer the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with op4mal character.
1/27/09
19
Sankoff Algorithm (cont.)
9 is derived from 7 + 2
So left child is T,
And right child is T
Sankoff Algorithm (cont.)
And the tree is thus labeled…
1/27/09
20
Analysis of Sankoff’s Algorithm
A dynamic programming problem algorithm: Op>mal substructure: solu4on obtained by solving smaller problem of same type.
st(parent) = mini {si( leo child ) + δi, t} +
minj {sj( right child ) + δj, t}
Recurrence terminates at
leaves, where solu4on is known.
t
sj(right child) si(leo child)
δj, t δi, t
Analysis of Sankoff’s Algorithm How many computa6ons do we perform for n species, m characters, and k
states per character? Forward step: • At each internal node of tree:
st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t} • 2k sums and 2 (k‐1) comparisons = 4k ‐2 • n‐1 internal nodes. • (4k – 2)(n ‐1) sums.
Traceback: one “lookup” per internal node. (n‐1) opera4ons
For each character (4k – 2)(n‐1) + (n‐1) opera4ons ≤ C n k • Above calcula4on performed once for each character:
≤ C m n k opera4ons • O( m n k) 4me. [“big‐O”] • Increases linearly w/ # of species or # of characters.
1/27/09
21
Analysis of Sankoff’s Algorithm
How many computa6ons do we perform for n species, m characters, and k states per character?
Traceback: 2k sums • Above calcula4on performed once for each character
• O( m n k) 4me. [“big‐O”] • Increases linearly w/ # of species or # of characters.
Fitch’s Algorithm
• Solves Small Parsimony problem – Published 4 years before Sankoff (1971)
• Makes two passes through tree: – Leaves root. – Root leaves.
1/27/09
22
Fitch Algorithm: Step 1
Assign a set S(v) of leHers to every vertex v in the tree, traversing the tree from leaves to root
• S(l) = observed character for each leaf l • For vertex v with children u and w:
if non‐empty intersec4on
, otherwise
• E.g. if the node we are looking at has a leo child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}
S(v) = S(u) ! S(w)S(u) " S(w)
Fitch’s Algorithm: Example
a
a
a
a
a
a
c
c
{t,a}
c
t
t
t
{t,a}
a
{a,c}
{a,c}
1/27/09
23
Fitch Algorithm: Step 2
Assign labels to each vertex, traversing the tree from root to leaves.
• Assign root r arbitrarily from its set S(r)
• For all other ver4ces v: – If its parent’s label is in its set S(v), assign it its parent’s label
– Else, choose an arbitrary leHer from its set S(v) as its label
Fitch’s Algorithm: Example
a
a
a
a
a
a
c
c
{t,a}
c
t
t
t
{t,a}
a
{a,c}
{a,c} a
a
a
a a t c
1/27/09
24
Fitch Algorithm (cont.)
Fitch vs. Sankoff
• Both have an O(nk) run4me
• Are they actually different?
• Let’s compare …
1/27/09
25
Fitch
As seen previously:
Comparison of Fitch and Sankoff
• As seen earlier, the scoring matrix for the Fitch algorithm is merely:
• So let’s do the same problem using Sankoff algorithm and this scoring matrix
A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0
1/27/09
26
Sankoff
Sankoff vs. Fitch
• The Sankoff algorithm gives the same set of op4mal labels as the Fitch algorithm
• For Sankoff algorithm, character t is op4mal for vertex v if st(v) = min1<i<k si(v)
• Let Sv = set of op4mal leHers for v. • Then
• This is also the Fitch recurrence • The two algorithms are iden4cal
Sv = Su ! Sw if Su ! Sv "= #,Su $ Sw, otherwise.
1/27/09
27
A Problem with Parsimony
Ignores branch lengths on trees
A
A
A A
A
C A
A
A
C A
A
A A
Same parsimony score. Muta4on “more likely” on longer branch.
Probabilis4c Model
Given a tree T with leaves labeled by present characters, what is the probability of a labeling of ancestral nodes?
Assume: 1. Characters evolve independently. 2. Constant rate of muta4on on each branch. 3. State of a vertex depends only on parent and branch length:
i.e. Pr[ x | y, t] depends only on y and t. (Markov process)
Pr[ x | y, t] = probability that y mutates to x in 4me t
y
x
t
1/27/09
28
Probabilis4c Model
Two species
Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2]
Pr[ x | y, t] = probability that y mutates to x in 4me t
T = tree topology x1 , x2 : characters for each species a : character for ancestor
a
x1 x2
t1 t2
y
x
t
qa = Pr[ ancestor has character a]
Probabilis4c Model n species: x1, x2, …, xn
Let α(i) = ancestor of node i. Let an+1, an+2, …, a2n‐1 = characters on internal nodes, where nodes are number from internal ver4ces up to root.
!
an+1,an+2,..,a2n!1
qa2n!1
2n!2"
i=n+1
Pr[ai|a!(i), ti]n"
i=1
Pr[xi|a!(i), ti]
Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi).
Pr[x1, ..., xn|T, t1, ..., t2n!2] =
1/27/09
29
Felsenstein’s Algorithm
Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.
Compute via dynamic programming
Pr[Tk|a] =!
b
Pr[b|a, ti]Pr[Ti|b]!
c
Pr[c|a, tj ]Pr[Tj |c]
Ini4al condi4ons. For k = 1, …, n (leaf nodes) Pr[Tk | a] = 1, if a = xk
0, otherwise.
a
b c
Compu4ng the Likelihood
Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.
Pr[x1, . . . , xn|T, t!] =!
a
Pr[T2n"1|a]qa
Note: Root is node 2n‐1
1/27/09
30
Maximum Likelihood
Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.
Traceback as before with Sankoff’s algorithm.
Pr[Tk|a] =!
maxb
Pr[b|a, ti]Pr[Ti|b]" #
maxc
Pr[c|a, tj ]Pr[Tj |c]$
Max. Parsimony vs. Max. Likelihood
• Set δij = ‐log P(j | i) in weighted parsimony (Sankoff algorithm)
• Weighted parsimony produces “maximum probability” assignments, ignoring branch lengths