30
1/27/09 1 CSCI1950‐Z Computa4onal Methods for Biology Lecture 2 Ben Raphael January 26, 2009 hHp://cs.brown.edu/courses/csci1950‐z/ Outline Review of trees. Coun4ng features. Character‐based phylogeny Maximum parsimony Maximum likelihood

Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

CSCI1950‐Z Computa4onal Methods for Biology  

Lecture 2 

Ben Raphael January 26, 2009 

hHp://cs.brown.edu/courses/csci1950‐z/ 

Outline 

•  Review of trees.  Coun4ng features. •  Character‐based phylogeny 

– Maximum parsimony 

– Maximum likelihood 

Page 2: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Tree Defini4ons  tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each edge (vi, vj) connects a pair of vertices.

A path in G is a sequence (v1, v2, …, vn) of vertices in V such that (vi, vi+1) are edges in E.

A graph is connected provided for every pair vi vj of vertices, there is a path between vi and vj.

A cycle is a path with the same starting and ending vertices.

A graph is acyclic provided it has no cycles.

Tree Defini4ons  degree of vertex v is the number of edges incident to v.

A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).

A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; (one parent and two children).

A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2).

w is a parent (ancestor) of v provided (v,w) is on path to root. In this case v is a child (descendant) of w.

Page 3: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Tree Defini4ons  tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v.

A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).

•  Leaves represent existing species •  Other vertices represent most recent common ancestor. •  Length of branches represent evolutionary time. •  Root (if present) represents the oldest evolutionary ancestor.

Coun4ng and Trees 

•  A tree with n ver4ces has n‐1 edges.  (Proof?) •  A rooted binary phylogene4c tree with n leaves has n‐1 internal ver4ces; and thus 2n ‐1 total ver4ces. 

•  How many rooted binary phylogene4c trees with n leaves?  

Page 4: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Character‐based Phylogene4c Tree Reconstruc4on 

1. What is character data? 2. What is the criteria for evalua6ng a tree? 3.  How do we op6mize this criteria: 

1.  Over all possible trees? 2.  Over a restricted class of trees? 

Output 

Op6mal phylogene4c tree 

Algorithm 

Input 

Characters 

 Molecular 

 Morphological 

Character‐Based Tree Reconstruc4on 

•  Characters may be nucleo4des of DNA (A, G, C, T) or amino acids (20 leHer alphabet). 

•  Values are called states of character.   

•  Characters may be morphological features # of eyes or legs or the shape of a beak or a fin.  

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

2‐state character 

Non‐informa4ve character 

Page 5: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Character‐Based Tree Reconstruc4on  

GOAL: determine what character strings at internal nodes would best explain the character strings for the n observed species 

An Example 

Value1  Value2 

Mouth  Smile  Frown 

Eyebrows  Normal  Pointed 

Page 6: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Character‐Based Tree Reconstruc4on 

Which tree is beAer? 

Character‐Based Tree Reconstruc4on 

Count changes on tree 

Page 7: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Character‐Based Tree Reconstruc4on 

Maximum Parsimony: minimize number of changes on edges of tree 

Maximum Parsimony 

•  Ockham’s razor: “simplest” explana4on for the data 

•  Assumes that observed character differences resulted from the fewest possible muta4ons 

•  Seeks tree with the lowest possible parsimony score, defined sum of cost of all muta4ons found in the tree 

Page 8: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Character Matrix 

Given n species, each labeled by m characters. Each character has k possible states. 

n x m character matrix 

Assume that characters in character string are independent. 

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Parsimony Score 

Assume that characters in character string are independent. 

Given character strings S=s1…sm and T=t1…tm:    #changes (S  T) = Σi dH(si, ti) 

 where dH = Hamming distance        dH(v, w) = 0 if v=w            dH(v, w) = 1 otherwise 

parsimony score of the tree as the sum of the lengths (weights) of the edges 

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Page 9: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

Parsimony and Tree Reconstruc4on  

Maximum Parsimony 

Two computa4onal sub‐problems: 1.  Find the parsimony score for a fixed tree.   

– Small Parsimony Problem (easy) 

2.  Find the lowest parsimony score over all trees with n leaves.   – Large parsimony problem (hard) 

Page 10: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

10 

Small Parsimony Problem 

Input: Tree T with each leaf labeled by an m‐character string. 

Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score. 

Since characters are independent, every leaf is labeled by a single character. 

Small Parsimony Problem 

Input:  T: tree with each leaf 

labeled by an m‐character string. 

Output:  Labeling of internal ver4ces 

of the tree T minimizing the parsimony score. 

Input:  M: an n x m character 

matrix.  Output:  A tree T with: •  n leaves labeled by the n 

rows of matrix M •  labeling of the internal 

ver4ces of T minimizing the parsimony 

score over all possible trees and all possible labelings of internal ver4ces 

Large Parsimony Problem 

Page 11: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

11 

Small Parsimony Problem 

Input: Binary tree T with each leaf labeled by an m‐character string. 

Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score. 

Since characters are independent, every leaf is labeled by a single character. 

Weighted Small Parsimony Problem 

More general version of Small Parsimony Problem 

•  Input includes a k x k scoring matrix δ describing the cost of transforming each of k states into another state.  

•  Small Parsimony Problem is special case: 

 δij =   0, if i = j,       1, otherwise. 

Page 12: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

12 

Scoring Matrices 

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0

Small Parsimony Problem Weighted Small Parsimony Problem

Unweighted vs. Weighted 

Small Parsimony Scoring Matrix:

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

Small Parsimony Score: 5

Page 13: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

13 

Unweighted vs. Weighted 

Weighted Parsimony Scoring Matrix:

A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0

Weighted Parsimony Score: 22

Weighted Small Parsimony Problem 

Input:  T: tree with each leaf labeled by an m‐character string from a k‐leHer alphabet. 

δ: k x k scoring matrix 

Output: Labeling of internal ver4ces of the tree T minimizing the weighted parsimony score. 

Page 14: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

14 

Sankoff Algorithm 

Calculate and keep track of a score for every possible label at each vertex:  st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t 

….  …. 

st(v)  

Sankoff Algorithm 

st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t 

The score st(v) is based only on scores of its children: st(parent) = mini {si( leo child )   + δi, t} +                        minj   {sj( right child ) + δj, t} 

sj(right child) si(leo child) 

δj, t δi, t 

Page 15: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

15 

Sankoff Algorithm (cont.) 

•  Begin at leaves: –  If leaf has the character in ques4on, score is 0 – Else, score is ∞ 

Sankoff Algorithm (cont.) 

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

si(u) δi, A sum

A 0 0 0

T ∞ 3 ∞

G ∞ 4 ∞

C ∞ 9 ∞

sA(v) = 0

Page 16: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

16 

Sankoff Algorithm (cont.) 

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

sj(u) δj, A sum

A ∞ 0 ∞

T ∞ 3 ∞

G ∞ 4 ∞

C 0 9 9

+ 9 = 9

sA(v) = 0

Sankoff Algorithm (cont.) 

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

Repeat for T, G, and C

Page 17: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

17 

Sankoff Algorithm (cont.) 

Repeat for right subtree

Sankoff Algorithm (cont.) 

Repeat for root

Page 18: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

18 

Sankoff Algorithm (cont.) 

Smallest score at root is minimum weighted parsimony score

In this case, 9 – so label with T

Sankoff Algorithm:  Traveling down the Tree 

•  The scores at the root vertex have been computed by going up the tree  

•  Aoer the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with op4mal character. 

Page 19: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

19 

Sankoff Algorithm (cont.) 

9 is derived from 7 + 2

So left child is T,

And right child is T

Sankoff Algorithm (cont.) 

And the tree is thus labeled…

Page 20: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

20 

Analysis of Sankoff’s Algorithm  

A dynamic programming problem algorithm: Op>mal substructure: solu4on obtained by solving smaller problem of same type. 

st(parent) = mini {si( leo child )   + δi, t} +  

                      minj   {sj( right child ) + δj, t} 

Recurrence terminates at  

leaves, where solu4on is known. 

sj(right child) si(leo child) 

δj, t δi, t 

Analysis of Sankoff’s Algorithm How many computa6ons do we perform for n species, m characters, and k 

states per character? Forward step: •  At each internal node of tree: 

st(parent) = mini {si( leo child )   + δi, t} + minj   {sj( right child ) + δj, t} •  2k sums and 2 (k‐1) comparisons = 4k ‐2 •  n‐1 internal nodes. •  (4k – 2)(n ‐1) sums. 

Traceback: one “lookup” per internal node. (n‐1) opera4ons 

For each character (4k – 2)(n‐1) + (n‐1) opera4ons ≤ C n k •  Above calcula4on performed once for each character: 

  ≤ C m n k opera4ons •  O( m n k) 4me.  [“big‐O”] •  Increases linearly w/ # of species or # of characters. 

Page 21: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

21 

Analysis of Sankoff’s Algorithm 

How many computa6ons do we perform for n species, m characters, and k states per character? 

Traceback: 2k sums •  Above calcula4on performed once for each character 

•  O( m n k) 4me.  [“big‐O”] •  Increases linearly w/ # of species or # of characters. 

Fitch’s Algorithm 

•  Solves Small Parsimony problem – Published 4 years before Sankoff (1971) 

•  Makes two passes through tree:   – Leaves  root. – Root  leaves. 

Page 22: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

22 

Fitch Algorithm: Step 1 

Assign a set S(v) of leHers to every vertex v in the tree, traversing the tree from leaves to root 

•  S(l) = observed character for each leaf l •  For vertex v with children u and w: 

                   if non‐empty intersec4on 

                   , otherwise  

•  E.g. if the node we are looking at has a leo child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T} 

S(v) = S(u) ! S(w)S(u) " S(w)

Fitch’s Algorithm: Example 

 {t,a} 

 {t,a} 

 a 

 {a,c} 

 {a,c} 

Page 23: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

23 

Fitch Algorithm: Step 2 

Assign labels to each vertex, traversing the tree from root to leaves. 

•  Assign root r arbitrarily from its set S(r) 

•  For all other ver4ces v: –  If its parent’s label is in its set S(v), assign it its parent’s label 

– Else, choose an arbitrary leHer from its set S(v) as its label 

Fitch’s Algorithm: Example 

 {t,a} 

 {t,a} 

 a 

 {a,c} 

 {a,c} a 

a a  t c 

Page 24: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

24 

Fitch Algorithm (cont.) 

Fitch vs. Sankoff 

•  Both have an O(nk) run4me 

•  Are they actually different? 

•  Let’s compare … 

Page 25: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

25 

Fitch 

As seen previously:

Comparison of Fitch and Sankoff 

•  As seen earlier, the scoring matrix for the Fitch algorithm is merely: 

•  So let’s do the same problem using Sankoff algorithm and this scoring matrix 

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

Page 26: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

26 

Sankoff 

Sankoff vs. Fitch 

•  The Sankoff algorithm gives the same set of op4mal labels as the Fitch algorithm 

•  For Sankoff algorithm, character t is op4mal for vertex v if st(v) = min1<i<k  si(v) 

•  Let Sv = set of op4mal leHers for v. •  Then  

•  This is also the Fitch recurrence •  The two algorithms are iden4cal 

Sv = Su ! Sw if Su ! Sv "= #,Su $ Sw, otherwise.

Page 27: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

27 

A Problem with Parsimony 

Ignores branch lengths on trees 

A A 

C A 

C A 

A A 

Same parsimony score. Muta4on “more likely” on longer branch. 

Probabilis4c Model 

Given a tree T with leaves labeled by present characters, what is the probability of a labeling of ancestral nodes? 

Assume: 1.  Characters evolve independently. 2.  Constant rate of muta4on on each branch. 3.  State of a vertex depends only on parent and branch length: 

i.e. Pr[ x | y, t] depends only on y and t. (Markov process) 

Pr[ x | y, t] = probability that y mutates to x in 4me t 

Page 28: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

28 

Probabilis4c Model 

Two species 

Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2] 

Pr[ x | y, t] = probability that y mutates to x in 4me t 

T = tree topology x1 , x2 : characters for each species a : character for ancestor 

x1  x2 

t1  t2 

qa = Pr[ ancestor has character a]  

Probabilis4c Model n species: x1, x2, …, xn 

Let α(i) =  ancestor of node i. Let an+1, an+2, …, a2n‐1 = characters on internal nodes, where nodes are number from internal ver4ces up to root. 

!

an+1,an+2,..,a2n!1

qa2n!1

2n!2"

i=n+1

Pr[ai|a!(i), ti]n"

i=1

Pr[xi|a!(i), ti]

Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi). 

Pr[x1, ..., xn|T, t1, ..., t2n!2] =

Page 29: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

29 

Felsenstein’s Algorithm 

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. 

Compute via dynamic programming 

Pr[Tk|a] =!

b

Pr[b|a, ti]Pr[Ti|b]!

c

Pr[c|a, tj ]Pr[Tj |c]

Ini4al condi4ons.  For k = 1, …, n (leaf nodes) Pr[Tk | a] = 1, if a = xk 

          0, otherwise. 

b  c 

Compu4ng the Likelihood 

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. 

Pr[x1, . . . , xn|T, t!] =!

a

Pr[T2n"1|a]qa

Note: Root is node 2n‐1 

Page 30: Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Definions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09 

30 

Maximum Likelihood 

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. 

Traceback as before with Sankoff’s algorithm. 

Pr[Tk|a] =!

maxb

Pr[b|a, ti]Pr[Ti|b]" #

maxc

Pr[c|a, tj ]Pr[Tj |c]$

Max. Parsimony vs. Max. Likelihood 

•  Set δij = ‐log P(j | i) in weighted parsimony (Sankoff algorithm) 

•  Weighted parsimony produces “maximum probability” assignments, ignoring branch lengths