Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. ©...

Preview:

Citation preview

.

Phylogenetic TreesLecture 11

Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al.

© Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau

shlomo moran
update this by the 1stand 2nd lecture to the seminar07;april07: updated by 1st lecture: deleted two tree figures and added slides 21-22, then inserted all the PP section from 2nd lecture Seminar07.after this: insert Fitch/Sankoff, then start distance based algs by UPGMA + new algorithm, and then DLCA instead of Gusfields algorithm.
shlomo moran
spring07: until 34 1st hour, finished 5-10 minutes before time.sprin08: moved to lecture10 (was 11 before), made some changes (hide slides+animation of fitch).1st 60 minutes untill 38 (end of PP), finished all in time.
shlomo moran
spring09: rewrote entirely.1st hour: untill 20 (end of "dangers of paralogs)total - untill Fitch's algotihm (didn't start it).quite leisurely, but this is probably the right way...

2

Evolution

Evolution of new organisms is driven by

Diversity Different individuals

carry different variants of the same basic blue print

Mutations The DNA sequence

can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

3

Theory of Evolution

Basic idea speciation events lead to creation of different

species (speciation: physical separation into groups where different genetic variants become dominant)

Any two species share a (possibly distant) common ancestor

This is described by a rooted tree – the tree of life.

shlomo moran
speciation=התמיינות

4

Any two species share a (possibly distant) common ancestor

The process of evolution consists of: speciation events. mutations along

evolutionary branches.

Tree of Life

Source: Alberts et al

5

Often only a subtree is studied

Definition: A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

6

Components of Phylogenenetic Trees

Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next The Tree Topology – the tree structure, ignoring edge

lengths. Usually the goal is to find the topolgy.

Aardvark Bison Chimp Dog Elephant

7

Historical Note Until mid 1950’s phylogenies were constructed by

experts based on their opinion (subjective criteria)

Since then, focus on objective criteria for constructing phylogenetic trees

Thousands of articles in the last decades

Important for many aspects of biology Classification Understanding biological mechanisms

.

A. Introduction (this lecture)

1. The phylogenetic Reconstruction Problem: from

sequences to trees

2.Morphological vs. molecular sequences3. Possible pitfalls

4. Directed and undirected trees

5. The “big” problem, the “small” problem.

Outline

.

B. Character based methods (this + next lectures)

1. Perfect Phylogeny

2. Maximum Parsimony

3. Maximum Likelihood (not studied in this course)

• These methods consider the evolution of each character

separately.

• Try to find the tree which gives the “best” evolutionary

explanation:

- least number of observed mutations (1&2), or most probable

tree (3).

• These optimization problems are typically NP-hard.

• We’ll discuss ways for solving simplified versions of the problems.

Outline (cont)

C. Distance based methods (last 1-2 lectures)

- Run in polynomial time

- Compute distances between all taxon-pairs

- Find a tree (edge-weighted) best-describing the distances

Outline (cont)

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

.

Distance Methods (cont.)

1. Efficient reconstruction (O(n2) time) from accurate

distances

2. Reconstruction from noisy distances: Can we

reconstruct accurate trees from approximate

distances?• Worst-case noise model

• More realistic noise models: inter-species distances derived

from probabilistic models of mutations.

Outline (end)

12

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGA

ACGGTCA

ACGGATA

ACGGGTA

ACCCGTG

ACCGTTG

TCTGGTA

TCTGGGA

TCCGGAA AGCCGTG

GGGGATT

AAAGTCA

AAAGGCG AAACACAAAAGCTG

Evolution as a Tree

13

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGAACCGTTGTCTGGGA

TCCGGAA AGCCGTG

GGGGATT

Phylogenetic Reconstruction

14

B : AATCCTG

C : ATAGCTG

A : AATGGGC

D : GAACGTA

E : AAACCGA

J : ACCGTTG

G : TCTGGGAH : TCCGGAA

I : AGCCGTG

F : GGGGATT

Goal: reconstruct the ‘true’ tree as accurately as possible

reconstruct

AB

C

FG

IH J

D

E

Phylogenetic Reconstruction

16

What are the sequences?Morphological vs. Molecular

Classical methods. morphological features: number of legs, lengths of legs, etc.

Modern methods. molecular features: Gene (DNA) sequences Protein sequences

Analysis based on homologous sequences (e.g., globins) in different species

17

Possible pitfall in reconstruction: Misleading selection of sequences

Gene/protein sequences can be homologous for several different reasons:

Orthologs -- sequences diverged after a speciation event

Paralogs -- sequences diverged after a duplication event (next slides)

Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

USER
אסתי יגר לוטם 31.12.02: פרלןג - גנים ששוכפלו בתוך אותו יצור, וקבלו תכונות שונות. השואת גנים (למשל המוגלובין אלפה ובטה, גם אצל עכבר וגם אצל אדם - השואת אלפה של אדם עם בטה של עכבר תיתו נתונים מוטעים על מידת המרחק)

18

Misleading selection of sequences:Using paralogs instead of orthologs

1 2 3

Consider evolutionary tree of three taxa:

…and assume that at some point in the past a gene duplication event occurred.

Gene Duplication

19

Paralogs instead of Orthologs

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

The gene evolution is described by this tree (1,2,3 are species; A, B are the copies of the same gene).

Copy BCopy A

20

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree.

In the sequel we assume all given sequences are orthologs – created from a common ancestor by specification events.

S

SS

Paralogs instead of Orthologs

21

Rooted vs. Undirected Trees

A natural representation of phylogeny is rooted trees

CommonAncestor

22

Types of treesUnrooted tree represents the same phylogeny without

the root node

Most known tree-reconstruction techniques do not distinguish between different placements of the root.

23

Rooted versus unrooted treesTree a

ab

Tree b

c

Tree c

Represents the three rooted trees

24

Positioning Roots in Unrooted Trees

We can estimate the position of the root by introducing an outgroup:

a set of species that are definitely distant from all the species of interest

Aardvark Bison Chimp Dog Elephant

Falcon

Proposed root

shlomo moran
בז=falcon
shlomo moran
shlomo moran10/6/2009aardvark:דב נמלים אפריקאי

25

Two phylogenenetic trees of the same species:Do these trees represent the same evolutionary history?

Aardvark Bison Chimp Dog Elephant

AardvarkBison

ChimpDog

Elephant

26

When two unrooted phylogenetic trees are considered different?

Trees T1 and T2 on the same set of species are considered identical if they represent the same evolutionary history, i.e.: they have the same topology.

Formally, this is equivalent to:

There is a tree isomorphism h: T1 T2 s.t: For each species x, h(x)=x.

27

The two trees represent the same evolution

Aardvark Bison Chimp Dog Elephant

AardvarkBison

ChimpDog

Elephant

w

v

h(u)

u

h(w)

h(v)

28

The “Big” reconstruction problem, the “Small” problem

The “big” problem: compute the whole phylogenetic tree from the n input sequences.

The “small” problem: Assume the tree topology and the identities of the leaf-species are known. Reconstruct the sequences at the internal vertices, and give a score to the resulted phylogeny.

Connection between the problems: In order to solve the big problem, solve the small problem on all possible trees with n leaves, and output the tree(s) with the highest “score”.

This is impossible in practice for more than few taxa.

29

Input for the “big” problem

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

Our task: Find evolutionary tree with leafs corresponding to the 5 sequences, which best explains the evolution of the strings.

30

Input for the “small” problem

Aardvark Bison Chimp Dog Elephant

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.

31

Character-based methodsfor constructing phylogenies

In this approach, trees are constructed by comparing the characters of the corresponding sequences. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). We will present two methods: “Perfect Phylogeny” and “Maximum Parsimony”

Basic Assumption in these methods:

Best tree is one with minimal number of observed mutations (character changes along the edges, aka substitutions).

32

Character based methods: Input data

species C1 C2 C3 C4 … Cm

dog A A C A G G T C T T C G A G G C C C

horse A A C A G G C C T A T G A G A C C C

frog A A C A G G T C T T T G A G T C C C

human A A C A G G T C T T T G A T G A C C

pig A A C A G T T C T T C G A T G G C C

* * * * * * * * * * *

• Each character (column) is processed independently.

• The green character will separate the human and pig from frog, horse and dog.

• The red character will separate the dog and pig from frog, horse and human.

33

The perfect phylogeny problem

A character is assumed to be a significant property, which distinguishes between species (e.g. dental structure, number of legs/limbs).

A characters state is a value of the character (eg: human dental structure).

Assumption: It is unlikely that a given state will be created twice in the evolution tree. Such characters are called “Homoplasy free”, and are detailed next.

34

Homoplasy-free characters 1

Homoplasy free characters should avoid:

reversal transitions

A species regains a state it’s direct ancestor has lost.

Famous known exceptions: Teeth in birds. Legs in snakes.

35

Homoplasy-free characters 2

…and also avoid convergence transitions

Two species possess the same state while their least common ancestor possesses a different state.

Famous known exceptions: The marsupials.

36

Input: 1. A set of species2. A set of characters3. For each character, assignment of states to the species

Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of all characters is “homoplasy free” (no reversal, no convergence)

The Perfect Phylogeny Problem

First, we define the problem using graph-theoretic terms.

37

Characters = Colorings

A coloring of a tree T=(V,E) is a mapping C:V [set of colors]

A partial coloring of T is a coloring of a subset of the vertices U V:

C:U [set of colors]

U=

38

Each character defines a (partial) coloring of the corresponding phylogenetic tree:

Characters as Colorings

Species ≡ VerticesStates ≡ Colors

39

Convex Colorings (and Characters)

Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint

Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d

40

A character is Homoplasy free (avoids reversal and convergence transitions)

The corresponding (partial) coloring is convex

Convexity Homoplasy Freedom

41

Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors).

Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T?

R B PR G PB B PR G A

The Perfect Phylogeny Problem(pure graph theoretic setting)

PP is NP-Hard In general In the tutorial you will see a special case

solvable in p-time .

42

Maximum Parsimony

Perfect Phylogeny is not only hard to compute, but in

many cases it doesn’t exist.

Next we discuss a more common approach, called

“Maximum Parsimony”, which looks for a tree which

minimizes the number of mutations.

43

Maximum Parsimony

A Character-based method

Input:

h sequences (one per species), all of length k.

Goal:

Find a tree whose leaves are labeled by the input

sequences, and an assignment of sequences to internal

nodes, such that the total number of substitutions is

minimized.

44

ExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

By the parsimony principle, we seek a tree whose leaves are labeled by the input sequences, and assignment of sequences to internal vertices, with minimum total number of mutations (ie, letter changes) along the tree edges. Here is one possible tree + sequences assignment.

45

Example ContinuedHere are two other trees+ sequence assignments:

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4

The left solution is preferred over the right one.

A solution has two parts: First, select a tree and label its leaves by the input sequences; then, assign sequences to the internal vertices.

47

Parsimony score

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Parsimony score = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Parsimony score = 4

The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.

48

Parsimony Based Reconstruction

We have here both the small and big problems:

1. The small problem: find the parsimony score for a given leaf labeled tree.

2. The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score.

3. We will see efficient algorithms for (1). (2) is hard.

Recommended