36
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu . HAPPY CHINESE NEW YEAR

CSCE555 Bioinformatics

Embed Size (px)

DESCRIPTION

CSCE555 Bioinformatics. Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: CSCE555 Bioinformatics

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 13 Phylogenetics II

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

HAPPY CHINESE NEW YEAR

Page 2: CSCE555 Bioinformatics

OutlineOutlineReview For ExamsData for Phylogenetic Tree

inferenceClassification of Tree inference

approachesNeighbor-joining algorithmParsimony-based tree

reconstructionLeast Square Best-fit

reconstruction04/24/23 2

Page 3: CSCE555 Bioinformatics

Midterm, MidtermMidterm, MidtermHow to review: read slides and

textbooks, especially CG book.Format of problems: examples

◦Brief questions: what is the difference between global alignment and local alignment?

◦calculation: build a HMM model for a multiple seq alignment

◦Definition: blasting, Motif, ORF

Page 4: CSCE555 Bioinformatics

Covered TopicsCovered TopicsUnderstand: concepts, algorithm ideas, tools

◦ Sequencing/blasting◦ Gene finding◦ Alignment algorithms and applications◦ DNA motif search◦ HMM profiles◦ Gene prediction algorithms◦ Promoter predictions◦ Comparative genomics◦ ……

Page 5: CSCE555 Bioinformatics

ABCDE

0

1

00

00

0

00

11

0

00

00

0

10

00

1

00

10

1

00

11

0

00

11

1

00

11

1

00

11

1

10

1

1 2 3 4 5 6 7 8 9 01

Characters

Taxa

ABCDE

Taxa

Distances

Phylogenetic Phylogenetic ReconstructionReconstructionThere are essentially two types of data for

phylogenetic tree estimation:◦ Distance data, usually stored in a distance matrix,

e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances

◦ Character data, usually stored in a character array; e.g. multiple sequence alignment of DNA sequences,

morphological characters.

Page 6: CSCE555 Bioinformatics

Phylogenetic Phylogenetic ReconstructionReconstructionGiven the huge number of

possible trees even for small data sets, we have two options:◦Build one according to some

clustering algorithm◦Assign a “goodness of fit” criterion

(an objective function) and find the tree(s) which optimise(s) this criterion

Page 7: CSCE555 Bioinformatics

CS369 2007 7

Distances NucleotideSites

Type of Data

UPGMA

Neighbor-Joining

Minimum Evolution

Maximum Parsimony

Maximum Likelihood

Tree

Bui

ldin

g M

etho

d

Opt

imal

ityC

riter

ion

Clu

ster

ing

Alg

orith

m

Phylogenetic Phylogenetic ReconstructionReconstruction

Page 8: CSCE555 Bioinformatics

Phylogenetic MethodsPhylogenetic Methods

Maximum likelihood• Maximizes likelihood of observed data

Many different procedures exist. Three of the most popular:

Maximum parsimony• Minimizes total evolutionary change

Neighbor-joining• Minimizes distance between nearest

neighbors

Page 9: CSCE555 Bioinformatics

Distance based tree Distance based tree ConstructionConstructionGiven a set of species (leaves in a supposed tree), and

distances between them – construct a phylogeny which best “fits” the distances.

Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman:CCTGTGACGTAGCAAACGA

OrcElfDwarfHobbitHuman

USER
לפני הבניה יש להכניס את משפט 4 הנקודות (מקובץ נפרד), שיחליף את ההוכחה הקודמת שלו בהרצאה 12. כמו כן ייתכן שכדאי לוותר על UPGMA. הערה זו משפיעה כמובן גם על הרצאה 12.שלמה 12.3.03
Page 10: CSCE555 Bioinformatics

Distance MatrixDistance MatrixGiven n species, we can compute the

n x n distance matrix Dij

Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

Dij can also be any other feature-based distances

Page 11: CSCE555 Bioinformatics

Distances in TreesDistances in TreesEdges may have weights reflecting:

◦Number of mutations on evolutionary path from one species to another

◦Time estimate for evolution of one species into another

In a tree T, we often compute dij(T) - the length of a path between

leaves i and j

Page 12: CSCE555 Bioinformatics

Distances in TreesDistances in TreesEdges may have weights reflecting:

◦Number of mutations on evolutionary path from one species to another

◦Time estimate for evolution of one species into another

In a tree T, we often compute dij(T) - the length of a path between leaves

i and j

Page 13: CSCE555 Bioinformatics

Distance in Trees: an Distance in Trees: an ExampeExampe

d1,4 = 12 + 13 + 14 + 17 + 12 = 68

i

j

Page 14: CSCE555 Bioinformatics

Fitting Distance MatrixFitting Distance MatrixGiven n species, we can compute

the n x n distance matrix Dij

Evolution of these genes is described by a tree that we don’t know.

We need an algorithm to construct a tree that best fits the distance matrix Dij

Page 15: CSCE555 Bioinformatics

Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree

Tree reconstruction for any 3x3 matrix is straightforward

We have 3 leaves i, j, k and a center vertex c

Observe:

dic + djc = Dij

dic + dkc = Dik

djc + dkc = Djk

Page 16: CSCE555 Bioinformatics

Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree

dic + djc = Dij

+ dic + dkc = Dik

2dic + djc + dkc = Dij + Dik

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2Similarly,

djc = (Dij + Djk – Dik)/2dkc = (Dki + Dkj – Dij)/2

Page 17: CSCE555 Bioinformatics

Trees with > 3 LeavesTrees with > 3 LeavesAn tree with n leaves has 2n-3 edges

This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables

This is not always possible to solve for n > 3

Page 18: CSCE555 Bioinformatics

Additive Distance MatricesAdditive Distance Matrices

Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij

NON-ADDITIVE otherwise

Page 19: CSCE555 Bioinformatics

Distance Based Phylogeny Distance Based Phylogeny ProblemProblemGoal: Reconstruct an evolutionary

tree from a distance matrixInput: n x n distance matrix Dij

Output: weighted tree T with n leaves fitting D

If D is additive, this problem has a solution and there is a simple algorithm to solve it

Page 20: CSCE555 Bioinformatics

Using Neighboring Leaves to Construct Using Neighboring Leaves to Construct the Treethe TreeFind neighboring leaves i and j with

parent kRemove the rows and columns of i and jAdd a new row and column corresponding to

k, where the distance from k to any other leaf m can be computed as:

Dkm = (Dim + Djm – Dij)/2

Compress i and j into k, iterate algorithm for rest of tree

Page 21: CSCE555 Bioinformatics

Finding Neighboring Finding Neighboring LeavesLeaves

• To find neighboring leaves we simply select a pair of closest leaves.

Page 22: CSCE555 Bioinformatics

Finding Neighboring Finding Neighboring LeavesLeaves

• To find neighboring leaves we simply select a pair of closest leaves.

WRONG

Page 23: CSCE555 Bioinformatics

Finding Neighboring Finding Neighboring LeavesLeaves

• Closest leaves aren’t necessarily neighbors• i and j are neighbors, but (dij = 13) > (djk = 12)

• Finding a pair of neighboring leaves is a nontrivial problem!

Page 24: CSCE555 Bioinformatics

Neighbor Finding: Seitou & Nei Neighbor Finding: Seitou & Nei algorithm (1987)algorithm (1987)

Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

)(),()(),(:,

.),(

ji

ui

rrjidLjiDji

uidri

2 leavesFor

let , leaf aFor leaf a is

Definitions

Page 25: CSCE555 Bioinformatics

Neighbor Joining Neighbor Joining AlgorithmAlgorithmIn 1987 Naruya Saitou and Masatoshi Nei

developed a neighbor joining algorithm for phylogenetic tree reconstruction

Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves

Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

Page 26: CSCE555 Bioinformatics

Neighbor-joiningNeighbor-joiningGuaranteed to produce the correct tree if

distance is additiveMay produce a good tree even when

distance is not additiveStep 1: Finding neighboring leavesDefineDij = dij – (ri + rj)Where

1 ri = –––––k dik

|L| - 2

1

2 4

3

0.1

0.1 0.1

0.4 0.4

Page 27: CSCE555 Bioinformatics

Algorithm: Neighbor-Algorithm: Neighbor-joiningjoiningInitialization:

Define T to be the set of leaf nodes, one per sequenceLet L = T

Iteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij)

for all m LAdd k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to LTermination:

When L consists of two nodes, i, j, and the edge between them of length dij

Page 28: CSCE555 Bioinformatics

Rooting a tree, and definition Rooting a tree, and definition of of outgroupoutgroupNeighbor-joining produces an unrooted treeHow do we root a tree between N species using n-j?

An outgroup is a species that we know to be more distantly related to all remaining species, than they are to one another

Example: Human, mouse, rat, pig, dog, chicken, whale

Which one is an outgroup?Outgroup can act as a root

1

2 3

4

Page 29: CSCE555 Bioinformatics

Neighbor Joining Algorithm-Widely Neighbor Joining Algorithm-Widely UsedUsedApplicable to matrices which are not additiveKnown to work good in practice The algorithm and its variants are the most

widely used distance-based algorithms today.

Page 30: CSCE555 Bioinformatics

Maximum Parsimony Method Maximum Parsimony Method for Tree Inferencefor Tree InferenceA Character-based methodInput: h sequences (one per species),

all of length k.Goal: Find a tree with the input

sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.

Two sub-problems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)

Page 31: CSCE555 Bioinformatics

ExampleExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.

Page 32: CSCE555 Bioinformatics

Least Squares Distance Least Squares Distance Phylogeny ProblemPhylogeny Problem

If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best:

Squared Error : ∑i,j (dij(T) – Dij)2

Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it.

Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).

Page 33: CSCE555 Bioinformatics

Search through tree Search through tree topologies: topologies: Branch and BoundBranch and BoundObservation: adding an edge to an existing tree can only increase

the parsimony cost

Enumerate all unrooted trees with at most n leaves:

[i3][i5][i7]……[i2N–5]]

where each ik can take values from 0 (no edge) to k

At each point keep C = smallest cost so far for a complete tree

Start B&B with tree [1][0][0]……[0]

Whenever cost of current tree T is > C, then:◦ T is not optimal◦ Any tree with more edges containing T, is not optimal:

Increment by 1 the rightmost nonzero counter

Page 34: CSCE555 Bioinformatics

Comparison of MethodsComparison of MethodsNeighbor-joining Maximum parsimony Maximum likelihood

Very fast Slow Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, strong conservation)

Good for very small data sets and for testing trees built using other methods

Page 35: CSCE555 Bioinformatics

SummarySummaryCategory of phylogenetic inference

algorithmsNeighbor-joining algorithm

Page 36: CSCE555 Bioinformatics

AcknowledgementAcknowledgementAnonymous authors