Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Deﬁnions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each

1/27/09

1

CSCI1950‐Z Computa4onal Methods for Biology

Lecture 2

Ben Raphael January 26, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Outline

•  Review of trees. Coun4ng features. •  Character‐based phylogeny

– Maximum parsimony

– Maximum likelihood

1/27/09

2

Tree Defini4ons tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each edge (vi, vj) connects a pair of vertices.

A path in G is a sequence (v1, v2, …, vn) of vertices in V such that (vi, vi+1) are edges in E.

A graph is connected provided for every pair vi vj of vertices, there is a path between vi and vj.

A cycle is a path with the same starting and ending vertices.

A graph is acyclic provided it has no cycles.

Tree Defini4ons degree of vertex v is the number of edges incident to v.

A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).

A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; (one parent and two children).

A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2).

w is a parent (ancestor) of v provided (v,w) is on path to root. In this case v is a child (descendant) of w.

1/27/09

3

Tree Defini4ons tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v.

A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).

•  Leaves represent existing species •  Other vertices represent most recent common ancestor. •  Length of branches represent evolutionary time. •  Root (if present) represents the oldest evolutionary ancestor.

Coun4ng and Trees

•  A tree with n ver4ces has n‐1 edges. (Proof?) •  A rooted binary phylogene4c tree with n leaves has n‐1 internal ver4ces; and thus 2n ‐1 total ver4ces.

•  How many rooted binary phylogene4c trees with n leaves?

1/27/09

4

Character‐based Phylogene4c Tree Reconstruc4on

1. What is character data? 2. What is the criteria for evalua6ng a tree? 3.  How do we op6mize this criteria:

1.  Over all possible trees? 2.  Over a restricted class of trees?

Output

Op6mal phylogene4c tree

Algorithm

Input

Characters

Molecular

Morphological

Character‐Based Tree Reconstruc4on

•  Characters may be nucleo4des of DNA (A, G, C, T) or amino acids (20 leHer alphabet).

•  Values are called states of character.

•  Characters may be morphological features # of eyes or legs or the shape of a beak or a fin.

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

2‐state character

Non‐informa4ve character

1/27/09

5


GOAL: determine what character strings at internal nodes would best explain the character strings for the n observed species

An Example

Value1 Value2

Mouth Smile Frown

Eyebrows Normal Pointed

1/27/09

6


Which tree is beAer?


Count changes on tree

1/27/09

7


Maximum Parsimony: minimize number of changes on edges of tree

Maximum Parsimony

•  Ockham’s razor: “simplest” explana4on for the data

•  Assumes that observed character differences resulted from the fewest possible muta4ons

•  Seeks tree with the lowest possible parsimony score, defined sum of cost of all muta4ons found in the tree

1/27/09

8

Character Matrix

Given n species, each labeled by m characters. Each character has k possible states.

n x m character matrix

Assume that characters in character string are independent.


Parsimony Score

Assume that characters in character string are independent.

Given character strings S=s1…sm and T=t1…tm: #changes (S T) = Σi dH(si, ti)

where dH = Hamming distance dH(v, w) = 0 if v=w dH(v, w) = 1 otherwise

parsimony score of the tree as the sum of the lengths (weights) of the edges


1/27/09

9

Parsimony and Tree Reconstruc4on

Maximum Parsimony

Two computa4onal sub‐problems: 1.  Find the parsimony score for a fixed tree.

– Small Parsimony Problem (easy)

2.  Find the lowest parsimony score over all trees with n leaves. – Large parsimony problem (hard)

1/27/09

10

Small Parsimony Problem

Input: Tree T with each leaf labeled by an m‐character string.

Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.

Since characters are independent, every leaf is labeled by a single character.


Input: T: tree with each leaf

labeled by an m‐character string.

Output: Labeling of internal ver4ces

of the tree T minimizing the parsimony score.

Input: M: an n x m character

matrix. Output: A tree T with: •  n leaves labeled by the n

rows of matrix M •  labeling of the internal

ver4ces of T minimizing the parsimony

score over all possible trees and all possible labelings of internal ver4ces

Large Parsimony Problem

1/27/09

11


Input: Binary tree T with each leaf labeled by an m‐character string.

Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.

Since characters are independent, every leaf is labeled by a single character.

Weighted Small Parsimony Problem

More general version of Small Parsimony Problem

•  Input includes a k x k scoring matrix δ describing the cost of transforming each of k states into another state.

•  Small Parsimony Problem is special case:

δij = 0, if i = j, 1, otherwise.

1/27/09

12

Scoring Matrices

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0

Small Parsimony Problem Weighted Small Parsimony Problem

Unweighted vs. Weighted

Small Parsimony Scoring Matrix:

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

Small Parsimony Score: 5

1/27/09

13

Unweighted vs. Weighted

Weighted Parsimony Scoring Matrix:

A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0

Weighted Parsimony Score: 22

Weighted Small Parsimony Problem

Input: T: tree with each leaf labeled by an m‐character string from a k‐leHer alphabet.

δ: k x k scoring matrix

Output: Labeling of internal ver4ces of the tree T minimizing the weighted parsimony score.

1/27/09

14

Sankoff Algorithm

Calculate and keep track of a score for every possible label at each vertex: st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t

t

…. ….

st(v)

Sankoff Algorithm

st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t

The score st(v) is based only on scores of its children: st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t}

t

sj(right child) si(leo child)

δj, t δi, t

1/27/09

15

Sankoff Algorithm (cont.)

•  Begin at leaves: –  If leaf has the character in ques4on, score is 0 – Else, score is ∞


st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

si(u) δi, A sum

A 0 0 0

T ∞ 3 ∞

G ∞ 4 ∞

C ∞ 9 ∞

sA(v) = 0

1/27/09

16



sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

sj(u) δj, A sum

A ∞ 0 ∞

T ∞ 3 ∞

G ∞ 4 ∞

C 0 9 9

+ 9 = 9

sA(v) = 0



Repeat for T, G, and C

1/27/09

17


Repeat for right subtree


Repeat for root

1/27/09

18


Smallest score at root is minimum weighted parsimony score

In this case, 9 – so label with T

Sankoff Algorithm: Traveling down the Tree

•  The scores at the root vertex have been computed by going up the tree

•  Aoer the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with op4mal character.

1/27/09

19


9 is derived from 7 + 2

So left child is T,

And right child is T


And the tree is thus labeled…

1/27/09

20

Analysis of Sankoff’s Algorithm

A dynamic programming problem algorithm: Op>mal substructure: solu4on obtained by solving smaller problem of same type.

st(parent) = mini {si( leo child ) + δi, t} +

minj {sj( right child ) + δj, t}

Recurrence terminates at

leaves, where solu4on is known.

t

sj(right child) si(leo child)

δj, t δi, t

Analysis of Sankoff’s Algorithm How many computa6ons do we perform for n species, m characters, and k

states per character? Forward step: •  At each internal node of tree:

st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t} •  2k sums and 2 (k‐1) comparisons = 4k ‐2 •  n‐1 internal nodes. •  (4k – 2)(n ‐1) sums.

Traceback: one “lookup” per internal node. (n‐1) opera4ons

For each character (4k – 2)(n‐1) + (n‐1) opera4ons ≤ C n k •  Above calcula4on performed once for each character:

≤ C m n k opera4ons •  O( m n k) 4me. [“big‐O”] •  Increases linearly w/ # of species or # of characters.

1/27/09

21

Analysis of Sankoff’s Algorithm

How many computa6ons do we perform for n species, m characters, and k states per character?

Traceback: 2k sums •  Above calcula4on performed once for each character

•  O( m n k) 4me. [“big‐O”] •  Increases linearly w/ # of species or # of characters.

Fitch’s Algorithm

•  Solves Small Parsimony problem – Published 4 years before Sankoff (1971)

•  Makes two passes through tree: – Leaves root. – Root leaves.

1/27/09

22

Fitch Algorithm: Step 1

Assign a set S(v) of leHers to every vertex v in the tree, traversing the tree from leaves to root

•  S(l) = observed character for each leaf l •  For vertex v with children u and w:

if non‐empty intersec4on

, otherwise

•  E.g. if the node we are looking at has a leo child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}

S(v) = S(u) ! S(w)S(u) " S(w)

Fitch’s Algorithm: Example

a

a

a

a

a

a

c

c

{t,a}

c

t

t

t

{t,a}

a

{a,c}

{a,c}

1/27/09

23

Fitch Algorithm: Step 2

Assign labels to each vertex, traversing the tree from root to leaves.

•  Assign root r arbitrarily from its set S(r)

•  For all other ver4ces v: –  If its parent’s label is in its set S(v), assign it its parent’s label

– Else, choose an arbitrary leHer from its set S(v) as its label

Fitch’s Algorithm: Example

a

a

a

a

a

a

c

c

{t,a}

c

t

t

t

{t,a}

a

{a,c}

{a,c} a

a

a

a a t c

1/27/09

24

Fitch Algorithm (cont.)

Fitch vs. Sankoff

•  Both have an O(nk) run4me

•  Are they actually different?

•  Let’s compare …

1/27/09

25

Fitch

As seen previously:

Comparison of Fitch and Sankoff

•  As seen earlier, the scoring matrix for the Fitch algorithm is merely:

•  So let’s do the same problem using Sankoff algorithm and this scoring matrix

A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0

1/27/09

26

Sankoff

Sankoff vs. Fitch

•  The Sankoff algorithm gives the same set of op4mal labels as the Fitch algorithm

•  For Sankoff algorithm, character t is op4mal for vertex v if st(v) = min1<i<k si(v)

•  Let Sv = set of op4mal leHers for v. •  Then

•  This is also the Fitch recurrence •  The two algorithms are iden4cal

Sv = Su ! Sw if Su ! Sv "= #,Su $ Sw, otherwise.

1/27/09

27

A Problem with Parsimony

Ignores branch lengths on trees

A

A

A A

A

C A

A

A

C A

A

A A

Same parsimony score. Muta4on “more likely” on longer branch.

Probabilis4c Model

Given a tree T with leaves labeled by present characters, what is the probability of a labeling of ancestral nodes?

Assume: 1.  Characters evolve independently. 2.  Constant rate of muta4on on each branch. 3.  State of a vertex depends only on parent and branch length:

i.e. Pr[ x | y, t] depends only on y and t. (Markov process)

Pr[ x | y, t] = probability that y mutates to x in 4me t

y

x

t

1/27/09

28

Probabilis4c Model

Two species

Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2]

Pr[ x | y, t] = probability that y mutates to x in 4me t

T = tree topology x1 , x2 : characters for each species a : character for ancestor

a

x1 x2

t1 t2

y

x

t

qa = Pr[ ancestor has character a]

Probabilis4c Model n species: x1, x2, …, xn

Let α(i) = ancestor of node i. Let an+1, an+2, …, a2n‐1 = characters on internal nodes, where nodes are number from internal ver4ces up to root.

!

an+1,an+2,..,a2n!1

qa2n!1

2n!2"

i=n+1

Pr[ai|a!(i), ti]n"

i=1

Pr[xi|a!(i), ti]

Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi).

Pr[x1, ..., xn|T, t1, ..., t2n!2] =

1/27/09

29

Felsenstein’s Algorithm

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.

Compute via dynamic programming

Pr[Tk|a] =!

b

Pr[b|a, ti]Pr[Ti|b]!

c

Pr[c|a, tj ]Pr[Tj |c]

Ini4al condi4ons. For k = 1, …, n (leaf nodes) Pr[Tk | a] = 1, if a = xk

0, otherwise.

a

b c

Compu4ng the Likelihood


Pr[x1, . . . , xn|T, t!] =!

a

Pr[T2n"1|a]qa

Note: Root is node 2n‐1

1/27/09

30

Maximum Likelihood


Traceback as before with Sankoff’s algorithm.

Pr[Tk|a] =!

maxb

Pr[b|a, ti]Pr[Ti|b]" #

maxc

Pr[c|a, tj ]Pr[Tj |c]$

Max. Parsimony vs. Max. Likelihood

•  Set δij = ‐log P(j | i) in weighted parsimony (Sankoff algorithm)

•  Weighted parsimony produces “maximum probability” assignments, ignoring branch lengths

Documents

Outline - Brown University · 2009-01-27 · 1/27/09 2 Tree Deﬁnions tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E of edges, where each