CSCE555 Bioinformatics

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 13 Phylogenetics II

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

HAPPY CHINESE NEW YEAR

http://www.cse.sc.edu/

OutlineOutlineReview For ExamsData for Phylogenetic Tree

inferenceClassification of Tree inference

approachesNeighbor-joining algorithmParsimony-based tree

reconstructionLeast Square Best-fit

reconstruction04/24/23 2

Midterm, MidtermMidterm, MidtermHow to review: read slides and

textbooks, especially CG book.Format of problems: examples

◦Brief questions: what is the difference between global alignment and local alignment?

◦calculation: build a HMM model for a multiple seq alignment

◦Definition: blasting, Motif, ORF

Covered TopicsCovered TopicsUnderstand: concepts, algorithm ideas, tools

◦ Sequencing/blasting◦ Gene finding◦ Alignment algorithms and applications◦ DNA motif search◦ HMM profiles◦ Gene prediction algorithms◦ Promoter predictions◦ Comparative genomics◦ ……

ABCDE

0

1

00

00

0

00

11

0

00

00

0

10

00

1

00

10

1

00

11

0

00

11

1

00

11

1

00

11

1

10

1

1 2 3 4 5 6 7 8 9 01

Characters

Taxa

ABCDE

Taxa

Distances

Phylogenetic Phylogenetic ReconstructionReconstructionThere are essentially two types of data for

phylogenetic tree estimation:◦ Distance data, usually stored in a distance matrix,

e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances

◦ Character data, usually stored in a character array; e.g. multiple sequence alignment of DNA sequences,

morphological characters.

Phylogenetic Phylogenetic ReconstructionReconstructionGiven the huge number of

possible trees even for small data sets, we have two options:◦Build one according to some

clustering algorithm◦Assign a “goodness of fit” criterion

(an objective function) and find the tree(s) which optimise(s) this criterion

CS369 2007 7

Distances NucleotideSites

Type of Data

UPGMA

Neighbor-Joining

Minimum Evolution

Maximum Parsimony

Maximum Likelihood

Tree

Bui

ldin

g M

etho

d

Opt

imal

ityC

riter

ion

Clu

ster

ing

Alg

orith

m

Phylogenetic Phylogenetic ReconstructionReconstruction

Phylogenetic MethodsPhylogenetic Methods

Maximum likelihood• Maximizes likelihood of observed data

Many different procedures exist. Three of the most popular:

Maximum parsimony• Minimizes total evolutionary change

Neighbor-joining• Minimizes distance between nearest

neighbors

Distance based tree Distance based tree ConstructionConstructionGiven a set of species (leaves in a supposed tree), and

distances between them – construct a phylogeny which best “fits” the distances.

Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman:CCTGTGACGTAGCAAACGA

OrcElfDwarfHobbitHuman

USER

לפני הבניה יש להכניס את משפט 4 הנקודות (מקובץ נפרד), שיחליף את ההוכחה הקודמת שלו בהרצאה 12. כמו כן ייתכן שכדאי לוותר על UPGMA. הערה זו משפיעה כמובן גם על הרצאה 12.שלמה 12.3.03

Distance MatrixDistance MatrixGiven n species, we can compute the

n x n distance matrix Dij

Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

Dij can also be any other feature-based distances

Distances in TreesDistances in TreesEdges may have weights reflecting:

◦Number of mutations on evolutionary path from one species to another

◦Time estimate for evolution of one species into another

In a tree T, we often compute dij(T) - the length of a path between

leaves i and j

Distances in TreesDistances in TreesEdges may have weights reflecting:

◦Number of mutations on evolutionary path from one species to another

◦Time estimate for evolution of one species into another

In a tree T, we often compute dij(T) - the length of a path between leaves

i and j

Distance in Trees: an Distance in Trees: an ExampeExampe

d1,4 = 12 + 13 + 14 + 17 + 12 = 68

i

j

Fitting Distance MatrixFitting Distance MatrixGiven n species, we can compute

the n x n distance matrix Dij

Evolution of these genes is described by a tree that we don’t know.

We need an algorithm to construct a tree that best fits the distance matrix Dij

Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree

Tree reconstruction for any 3x3 matrix is straightforward

We have 3 leaves i, j, k and a center vertex c

Observe:

dic + djc = Dij

dic + dkc = Dik

djc + dkc = Djk

Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree

dic + djc = Dij

+ dic + dkc = Dik

2dic + djc + dkc = Dij + Dik

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2Similarly,

djc = (Dij + Djk – Dik)/2dkc = (Dki + Dkj – Dij)/2

Trees with > 3 LeavesTrees with > 3 LeavesAn tree with n leaves has 2n-3 edges

This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables

This is not always possible to solve for n > 3

Additive Distance MatricesAdditive Distance Matrices

Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij

NON-ADDITIVE otherwise

Distance Based Phylogeny Distance Based Phylogeny ProblemProblemGoal: Reconstruct an evolutionary

tree from a distance matrixInput: n x n distance matrix Dij

Output: weighted tree T with n leaves fitting D

If D is additive, this problem has a solution and there is a simple algorithm to solve it

Using Neighboring Leaves to Construct Using Neighboring Leaves to Construct the Treethe TreeFind neighboring leaves i and j with

parent kRemove the rows and columns of i and jAdd a new row and column corresponding to

k, where the distance from k to any other leaf m can be computed as:

Dkm = (Dim + Djm – Dij)/2

Compress i and j into k, iterate algorithm for rest of tree

Finding Neighboring Finding Neighboring LeavesLeaves

• To find neighboring leaves we simply select a pair of closest leaves.


• To find neighboring leaves we simply select a pair of closest leaves.

WRONG


• Closest leaves aren’t necessarily neighbors• i and j are neighbors, but (dij = 13) > (djk = 12)

• Finding a pair of neighboring leaves is a nontrivial problem!

Neighbor Finding: Seitou & Nei Neighbor Finding: Seitou & Nei algorithm (1987)algorithm (1987)

Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

)(),()(),(:,

.),(

ji

ui

rrjidLjiDji

uidri

2 leavesFor

let , leaf aFor leaf a is

Definitions

Neighbor Joining Neighbor Joining AlgorithmAlgorithmIn 1987 Naruya Saitou and Masatoshi Nei

developed a neighbor joining algorithm for phylogenetic tree reconstruction

Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves

Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

Neighbor-joiningNeighbor-joiningGuaranteed to produce the correct tree if

distance is additiveMay produce a good tree even when

distance is not additiveStep 1: Finding neighboring leavesDefineDij = dij – (ri + rj)Where

1 ri = –––––k dik

|L| - 2

1

2 4

3

0.1

0.1 0.1

0.4 0.4

Algorithm: Neighbor-Algorithm: Neighbor-joiningjoiningInitialization:

Define T to be the set of leaf nodes, one per sequenceLet L = T

Iteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij)

for all m LAdd k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to LTermination:

When L consists of two nodes, i, j, and the edge between them of length dij

Rooting a tree, and definition Rooting a tree, and definition of of outgroupoutgroupNeighbor-joining produces an unrooted treeHow do we root a tree between N species using n-j?

An outgroup is a species that we know to be more distantly related to all remaining species, than they are to one another

Example: Human, mouse, rat, pig, dog, chicken, whale

Which one is an outgroup?Outgroup can act as a root

1

2 3

4

Neighbor Joining Algorithm-Widely Neighbor Joining Algorithm-Widely UsedUsedApplicable to matrices which are not additiveKnown to work good in practice The algorithm and its variants are the most

widely used distance-based algorithms today.

Maximum Parsimony Method Maximum Parsimony Method for Tree Inferencefor Tree InferenceA Character-based methodInput: h sequences (one per species),

all of length k.Goal: Find a tree with the input

sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.

Two sub-problems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)

ExampleExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.

Least Squares Distance Least Squares Distance Phylogeny ProblemPhylogeny Problem

If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best:

Squared Error : ∑i,j (dij(T) – Dij)2

Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it.

Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).

Search through tree Search through tree topologies: topologies: Branch and BoundBranch and BoundObservation: adding an edge to an existing tree can only increase

the parsimony cost

Enumerate all unrooted trees with at most n leaves:

[i3][i5][i7]……[i2N–5]]

where each ik can take values from 0 (no edge) to k

At each point keep C = smallest cost so far for a complete tree

Start B&B with tree [1][0][0]……[0]

Whenever cost of current tree T is > C, then:◦ T is not optimal◦ Any tree with more edges containing T, is not optimal:

Increment by 1 the rightmost nonzero counter

Comparison of MethodsComparison of MethodsNeighbor-joining Maximum parsimony Maximum likelihood

Very fast Slow Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, strong conservation)

Good for very small data sets and for testing trees built using other methods

SummarySummaryCategory of phylogenetic inference

algorithmsNeighbor-joining algorithm

AcknowledgementAcknowledgementAnonymous authors

Documents

CSCE555 Bioinformatics