46
. Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau

Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

Embed Size (px)

Citation preview

Page 1: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

.

Phylogenetic TreesLecture 11

Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al.

© Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau

shlomo moran
update this by the 1stand 2nd lecture to the seminar07;april07: updated by 1st lecture: deleted two tree figures and added slides 21-22, then inserted all the PP section from 2nd lecture Seminar07.after this: insert Fitch/Sankoff, then start distance based algs by UPGMA + new algorithm, and then DLCA instead of Gusfields algorithm.
shlomo moran
spring07: until 34 1st hour, finished 5-10 minutes before time.sprin08: moved to lecture10 (was 11 before), made some changes (hide slides+animation of fitch).1st 60 minutes untill 38 (end of PP), finished all in time.
shlomo moran
spring09: rewrote entirely.1st hour: untill 20 (end of "dangers of paralogs)total - untill Fitch's algotihm (didn't start it).quite leisurely, but this is probably the right way...
Page 2: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

2

Evolution

Evolution of new organisms is driven by

Diversity Different individuals

carry different variants of the same basic blue print

Mutations The DNA sequence

can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

Page 3: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

3

Theory of Evolution

Basic idea speciation events lead to creation of different

species (speciation: physical separation into groups where different genetic variants become dominant)

Any two species share a (possibly distant) common ancestor

This is described by a rooted tree – the tree of life.

shlomo moran
speciation=התמיינות
Page 4: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

4

Any two species share a (possibly distant) common ancestor

The process of evolution consists of: speciation events. mutations along

evolutionary branches.

Tree of Life

Source: Alberts et al

Page 5: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

5

Often only a subtree is studied

Definition: A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

Page 6: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

6

Components of Phylogenenetic Trees

Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next The Tree Topology – the tree structure, ignoring edge

lengths. Usually the goal is to find the topolgy.

Aardvark Bison Chimp Dog Elephant

Page 7: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

7

Historical Note Until mid 1950’s phylogenies were constructed by

experts based on their opinion (subjective criteria)

Since then, focus on objective criteria for constructing phylogenetic trees

Thousands of articles in the last decades

Important for many aspects of biology Classification Understanding biological mechanisms

Page 8: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

.

A. Introduction (this lecture)

1. The phylogenetic Reconstruction Problem: from

sequences to trees

2.Morphological vs. molecular sequences3. Possible pitfalls

4. Directed and undirected trees

5. The “big” problem, the “small” problem.

Outline

Page 9: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

.

B. Character based methods (this + next lectures)

1. Perfect Phylogeny

2. Maximum Parsimony

3. Maximum Likelihood (not studied in this course)

• These methods consider the evolution of each character

separately.

• Try to find the tree which gives the “best” evolutionary

explanation:

- least number of observed mutations (1&2), or most probable

tree (3).

• These optimization problems are typically NP-hard.

• We’ll discuss ways for solving simplified versions of the problems.

Outline (cont)

Page 10: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

C. Distance based methods (last 1-2 lectures)

- Run in polynomial time

- Compute distances between all taxon-pairs

- Find a tree (edge-weighted) best-describing the distances

Outline (cont)

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

Page 11: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

.

Distance Methods (cont.)

1. Efficient reconstruction (O(n2) time) from accurate

distances

2. Reconstruction from noisy distances: Can we

reconstruct accurate trees from approximate

distances?• Worst-case noise model

• More realistic noise models: inter-species distances derived

from probabilistic models of mutations.

Outline (end)

Page 12: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

12

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGA

ACGGTCA

ACGGATA

ACGGGTA

ACCCGTG

ACCGTTG

TCTGGTA

TCTGGGA

TCCGGAA AGCCGTG

GGGGATT

AAAGTCA

AAAGGCG AAACACAAAAGCTG

Evolution as a Tree

Page 13: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

13

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGAACCGTTGTCTGGGA

TCCGGAA AGCCGTG

GGGGATT

Phylogenetic Reconstruction

Page 14: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

14

B : AATCCTG

C : ATAGCTG

A : AATGGGC

D : GAACGTA

E : AAACCGA

J : ACCGTTG

G : TCTGGGAH : TCCGGAA

I : AGCCGTG

F : GGGGATT

Goal: reconstruct the ‘true’ tree as accurately as possible

reconstruct

AB

C

FG

IH J

D

E

Phylogenetic Reconstruction

Page 15: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

16

What are the sequences?Morphological vs. Molecular

Classical methods. morphological features: number of legs, lengths of legs, etc.

Modern methods. molecular features: Gene (DNA) sequences Protein sequences

Analysis based on homologous sequences (e.g., globins) in different species

Page 16: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

17

Possible pitfall in reconstruction: Misleading selection of sequences

Gene/protein sequences can be homologous for several different reasons:

Orthologs -- sequences diverged after a speciation event

Paralogs -- sequences diverged after a duplication event (next slides)

Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

USER
אסתי יגר לוטם 31.12.02: פרלןג - גנים ששוכפלו בתוך אותו יצור, וקבלו תכונות שונות. השואת גנים (למשל המוגלובין אלפה ובטה, גם אצל עכבר וגם אצל אדם - השואת אלפה של אדם עם בטה של עכבר תיתו נתונים מוטעים על מידת המרחק)
Page 17: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

18

Misleading selection of sequences:Using paralogs instead of orthologs

1 2 3

Consider evolutionary tree of three taxa:

…and assume that at some point in the past a gene duplication event occurred.

Gene Duplication

Page 18: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

19

Paralogs instead of Orthologs

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

The gene evolution is described by this tree (1,2,3 are species; A, B are the copies of the same gene).

Copy BCopy A

Page 19: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

20

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree.

In the sequel we assume all given sequences are orthologs – created from a common ancestor by specification events.

S

SS

Paralogs instead of Orthologs

Page 20: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

21

Rooted vs. Undirected Trees

A natural representation of phylogeny is rooted trees

CommonAncestor

Page 21: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

22

Types of treesUnrooted tree represents the same phylogeny without

the root node

Most known tree-reconstruction techniques do not distinguish between different placements of the root.

Page 22: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

23

Rooted versus unrooted treesTree a

ab

Tree b

c

Tree c

Represents the three rooted trees

Page 23: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

24

Positioning Roots in Unrooted Trees

We can estimate the position of the root by introducing an outgroup:

a set of species that are definitely distant from all the species of interest

Aardvark Bison Chimp Dog Elephant

Falcon

Proposed root

shlomo moran
בז=falcon
shlomo moran
shlomo moran10/6/2009aardvark:דב נמלים אפריקאי
Page 24: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

25

Two phylogenenetic trees of the same species:Do these trees represent the same evolutionary history?

Aardvark Bison Chimp Dog Elephant

AardvarkBison

ChimpDog

Elephant

Page 25: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

26

When two unrooted phylogenetic trees are considered different?

Trees T1 and T2 on the same set of species are considered identical if they represent the same evolutionary history, i.e.: they have the same topology.

Formally, this is equivalent to:

There is a tree isomorphism h: T1 T2 s.t: For each species x, h(x)=x.

Page 26: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

27

The two trees represent the same evolution

Aardvark Bison Chimp Dog Elephant

AardvarkBison

ChimpDog

Elephant

w

v

h(u)

u

h(w)

h(v)

Page 27: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

28

The “Big” reconstruction problem, the “Small” problem

The “big” problem: compute the whole phylogenetic tree from the n input sequences.

The “small” problem: Assume the tree topology and the identities of the leaf-species are known. Reconstruct the sequences at the internal vertices, and give a score to the resulted phylogeny.

Connection between the problems: In order to solve the big problem, solve the small problem on all possible trees with n leaves, and output the tree(s) with the highest “score”.

This is impossible in practice for more than few taxa.

Page 28: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

29

Input for the “big” problem

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

Our task: Find evolutionary tree with leafs corresponding to the 5 sequences, which best explains the evolution of the strings.

Page 29: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

30

Input for the “small” problem

Aardvark Bison Chimp Dog Elephant

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.

Page 30: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

31

Character-based methodsfor constructing phylogenies

In this approach, trees are constructed by comparing the characters of the corresponding sequences. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). We will present two methods: “Perfect Phylogeny” and “Maximum Parsimony”

Basic Assumption in these methods:

Best tree is one with minimal number of observed mutations (character changes along the edges, aka substitutions).

Page 31: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

32

Character based methods: Input data

species C1 C2 C3 C4 … Cm

dog A A C A G G T C T T C G A G G C C C

horse A A C A G G C C T A T G A G A C C C

frog A A C A G G T C T T T G A G T C C C

human A A C A G G T C T T T G A T G A C C

pig A A C A G T T C T T C G A T G G C C

* * * * * * * * * * *

• Each character (column) is processed independently.

• The green character will separate the human and pig from frog, horse and dog.

• The red character will separate the dog and pig from frog, horse and human.

Page 32: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

33

The perfect phylogeny problem

A character is assumed to be a significant property, which distinguishes between species (e.g. dental structure, number of legs/limbs).

A characters state is a value of the character (eg: human dental structure).

Assumption: It is unlikely that a given state will be created twice in the evolution tree. Such characters are called “Homoplasy free”, and are detailed next.

Page 33: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

34

Homoplasy-free characters 1

Homoplasy free characters should avoid:

reversal transitions

A species regains a state it’s direct ancestor has lost.

Famous known exceptions: Teeth in birds. Legs in snakes.

Page 34: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

35

Homoplasy-free characters 2

…and also avoid convergence transitions

Two species possess the same state while their least common ancestor possesses a different state.

Famous known exceptions: The marsupials.

Page 35: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

36

Input: 1. A set of species2. A set of characters3. For each character, assignment of states to the species

Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of all characters is “homoplasy free” (no reversal, no convergence)

The Perfect Phylogeny Problem

First, we define the problem using graph-theoretic terms.

Page 36: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

37

Characters = Colorings

A coloring of a tree T=(V,E) is a mapping C:V [set of colors]

A partial coloring of T is a coloring of a subset of the vertices U V:

C:U [set of colors]

U=

Page 37: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

38

Each character defines a (partial) coloring of the corresponding phylogenetic tree:

Characters as Colorings

Species ≡ VerticesStates ≡ Colors

Page 38: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

39

Convex Colorings (and Characters)

Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint

Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d

Page 39: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

40

A character is Homoplasy free (avoids reversal and convergence transitions)

The corresponding (partial) coloring is convex

Convexity Homoplasy Freedom

Page 40: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

41

Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors).

Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T?

R B PR G PB B PR G A

The Perfect Phylogeny Problem(pure graph theoretic setting)

PP is NP-Hard In general In the tutorial you will see a special case

solvable in p-time .

Page 41: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

42

Maximum Parsimony

Perfect Phylogeny is not only hard to compute, but in

many cases it doesn’t exist.

Next we discuss a more common approach, called

“Maximum Parsimony”, which looks for a tree which

minimizes the number of mutations.

Page 42: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

43

Maximum Parsimony

A Character-based method

Input:

h sequences (one per species), all of length k.

Goal:

Find a tree whose leaves are labeled by the input

sequences, and an assignment of sequences to internal

nodes, such that the total number of substitutions is

minimized.

Page 43: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

44

ExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

By the parsimony principle, we seek a tree whose leaves are labeled by the input sequences, and assignment of sequences to internal vertices, with minimum total number of mutations (ie, letter changes) along the tree edges. Here is one possible tree + sequences assignment.

Page 44: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

45

Example ContinuedHere are two other trees+ sequence assignments:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4

The left solution is preferred over the right one.

A solution has two parts: First, select a tree and label its leaves by the input sequences; then, assign sequences to the internal vertices.

Page 45: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

47

Parsimony score

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Parsimony score = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Parsimony score = 4

The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.

Page 46: Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan

48

Parsimony Based Reconstruction

We have here both the small and big problems:

1. The small problem: find the parsimony score for a given leaf labeled tree.

2. The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score.

3. We will see efficient algorithms for (1). (2) is hard.