Upload
oliver-gill
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CSCE555 Bioinformatics. Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Outline. - PowerPoint PPT Presentation
Citation preview
CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 13 Phylogenetics II
Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
HAPPY CHINESE NEW YEAR
OutlineOutlineReview For ExamsData for Phylogenetic Tree
inferenceClassification of Tree inference
approachesNeighbor-joining algorithmParsimony-based tree
reconstructionLeast Square Best-fit
reconstruction04/24/23 2
Midterm, MidtermMidterm, MidtermHow to review: read slides and
textbooks, especially CG book.Format of problems: examples
◦Brief questions: what is the difference between global alignment and local alignment?
◦calculation: build a HMM model for a multiple seq alignment
◦Definition: blasting, Motif, ORF
Covered TopicsCovered TopicsUnderstand: concepts, algorithm ideas, tools
◦ Sequencing/blasting◦ Gene finding◦ Alignment algorithms and applications◦ DNA motif search◦ HMM profiles◦ Gene prediction algorithms◦ Promoter predictions◦ Comparative genomics◦ ……
ABCDE
0
1
00
00
0
00
11
0
00
00
0
10
00
1
00
10
1
00
11
0
00
11
1
00
11
1
00
11
1
10
1
1 2 3 4 5 6 7 8 9 01
Characters
Taxa
ABCDE
Taxa
Distances
Phylogenetic Phylogenetic ReconstructionReconstructionThere are essentially two types of data for
phylogenetic tree estimation:◦ Distance data, usually stored in a distance matrix,
e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances
◦ Character data, usually stored in a character array; e.g. multiple sequence alignment of DNA sequences,
morphological characters.
Phylogenetic Phylogenetic ReconstructionReconstructionGiven the huge number of
possible trees even for small data sets, we have two options:◦Build one according to some
clustering algorithm◦Assign a “goodness of fit” criterion
(an objective function) and find the tree(s) which optimise(s) this criterion
CS369 2007 7
Distances NucleotideSites
Type of Data
UPGMA
Neighbor-Joining
Minimum Evolution
Maximum Parsimony
Maximum Likelihood
Tree
Bui
ldin
g M
etho
d
Opt
imal
ityC
riter
ion
Clu
ster
ing
Alg
orith
m
Phylogenetic Phylogenetic ReconstructionReconstruction
Phylogenetic MethodsPhylogenetic Methods
Maximum likelihood• Maximizes likelihood of observed data
Many different procedures exist. Three of the most popular:
Maximum parsimony• Minimizes total evolutionary change
Neighbor-joining• Minimizes distance between nearest
neighbors
Distance based tree Distance based tree ConstructionConstructionGiven a set of species (leaves in a supposed tree), and
distances between them – construct a phylogeny which best “fits” the distances.
Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman:CCTGTGACGTAGCAAACGA
OrcElfDwarfHobbitHuman
Distance MatrixDistance MatrixGiven n species, we can compute the
n x n distance matrix Dij
Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.
Dij can also be any other feature-based distances
Distances in TreesDistances in TreesEdges may have weights reflecting:
◦Number of mutations on evolutionary path from one species to another
◦Time estimate for evolution of one species into another
In a tree T, we often compute dij(T) - the length of a path between
leaves i and j
Distances in TreesDistances in TreesEdges may have weights reflecting:
◦Number of mutations on evolutionary path from one species to another
◦Time estimate for evolution of one species into another
In a tree T, we often compute dij(T) - the length of a path between leaves
i and j
Distance in Trees: an Distance in Trees: an ExampeExampe
d1,4 = 12 + 13 + 14 + 17 + 12 = 68
i
j
Fitting Distance MatrixFitting Distance MatrixGiven n species, we can compute
the n x n distance matrix Dij
Evolution of these genes is described by a tree that we don’t know.
We need an algorithm to construct a tree that best fits the distance matrix Dij
Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree
Tree reconstruction for any 3x3 matrix is straightforward
We have 3 leaves i, j, k and a center vertex c
Observe:
dic + djc = Dij
dic + dkc = Dik
djc + dkc = Djk
Reconstructing a 3 Leaved Reconstructing a 3 Leaved TreeTree
dic + djc = Dij
+ dic + dkc = Dik
2dic + djc + dkc = Dij + Dik
2dic + Djk = Dij + Dik
dic = (Dij + Dik – Djk)/2Similarly,
djc = (Dij + Djk – Dik)/2dkc = (Dki + Dkj – Dij)/2
Trees with > 3 LeavesTrees with > 3 LeavesAn tree with n leaves has 2n-3 edges
This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables
This is not always possible to solve for n > 3
Additive Distance MatricesAdditive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij
NON-ADDITIVE otherwise
Distance Based Phylogeny Distance Based Phylogeny ProblemProblemGoal: Reconstruct an evolutionary
tree from a distance matrixInput: n x n distance matrix Dij
Output: weighted tree T with n leaves fitting D
If D is additive, this problem has a solution and there is a simple algorithm to solve it
Using Neighboring Leaves to Construct Using Neighboring Leaves to Construct the Treethe TreeFind neighboring leaves i and j with
parent kRemove the rows and columns of i and jAdd a new row and column corresponding to
k, where the distance from k to any other leaf m can be computed as:
Dkm = (Dim + Djm – Dij)/2
Compress i and j into k, iterate algorithm for rest of tree
Finding Neighboring Finding Neighboring LeavesLeaves
• To find neighboring leaves we simply select a pair of closest leaves.
Finding Neighboring Finding Neighboring LeavesLeaves
• To find neighboring leaves we simply select a pair of closest leaves.
WRONG
Finding Neighboring Finding Neighboring LeavesLeaves
• Closest leaves aren’t necessarily neighbors• i and j are neighbors, but (dij = 13) > (djk = 12)
• Finding a pair of neighboring leaves is a nontrivial problem!
Neighbor Finding: Seitou & Nei Neighbor Finding: Seitou & Nei algorithm (1987)algorithm (1987)
Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.
)(),()(),(:,
.),(
ji
ui
rrjidLjiDji
uidri
2 leavesFor
let , leaf aFor leaf a is
Definitions
Neighbor Joining Neighbor Joining AlgorithmAlgorithmIn 1987 Naruya Saitou and Masatoshi Nei
developed a neighbor joining algorithm for phylogenetic tree reconstruction
Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves
Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption
Neighbor-joiningNeighbor-joiningGuaranteed to produce the correct tree if
distance is additiveMay produce a good tree even when
distance is not additiveStep 1: Finding neighboring leavesDefineDij = dij – (ri + rj)Where
1 ri = –––––k dik
|L| - 2
1
2 4
3
0.1
0.1 0.1
0.4 0.4
Algorithm: Neighbor-Algorithm: Neighbor-joiningjoiningInitialization:
Define T to be the set of leaf nodes, one per sequenceLet L = T
Iteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij)
for all m LAdd k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to LTermination:
When L consists of two nodes, i, j, and the edge between them of length dij
Rooting a tree, and definition Rooting a tree, and definition of of outgroupoutgroupNeighbor-joining produces an unrooted treeHow do we root a tree between N species using n-j?
An outgroup is a species that we know to be more distantly related to all remaining species, than they are to one another
Example: Human, mouse, rat, pig, dog, chicken, whale
Which one is an outgroup?Outgroup can act as a root
1
2 3
4
Neighbor Joining Algorithm-Widely Neighbor Joining Algorithm-Widely UsedUsedApplicable to matrices which are not additiveKnown to work good in practice The algorithm and its variants are the most
widely used distance-based algorithms today.
Maximum Parsimony Method Maximum Parsimony Method for Tree Inferencefor Tree InferenceA Character-based methodInput: h sequences (one per species),
all of length k.Goal: Find a tree with the input
sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.
Two sub-problems:1. Find the parsimony cost of a given tree (easy)2. Search through all tree topologies (hard)
ExampleExampleInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.
AGAAAA
GGAAAG
AAA AAA
AAA
21 1
Total #substitutions = 4
By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.
Least Squares Distance Least Squares Distance Phylogeny ProblemPhylogeny Problem
If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best:
Squared Error : ∑i,j (dij(T) – Dij)2
Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it.
Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).
Search through tree Search through tree topologies: topologies: Branch and BoundBranch and BoundObservation: adding an edge to an existing tree can only increase
the parsimony cost
Enumerate all unrooted trees with at most n leaves:
[i3][i5][i7]……[i2N–5]]
where each ik can take values from 0 (no edge) to k
At each point keep C = smallest cost so far for a complete tree
Start B&B with tree [1][0][0]……[0]
Whenever cost of current tree T is > C, then:◦ T is not optimal◦ Any tree with more edges containing T, is not optimal:
Increment by 1 the rightmost nonzero counter
Comparison of MethodsComparison of MethodsNeighbor-joining Maximum parsimony Maximum likelihood
Very fast Slow Very slow
Easily trapped in local optima
Assumptions fail when evolution is rapid
Highly dependent on assumed evolution model
Good for generating tentative tree, or choosing among multiple trees
Best option when tractable (<30 taxa, strong conservation)
Good for very small data sets and for testing trees built using other methods
SummarySummaryCategory of phylogenetic inference
algorithmsNeighbor-joining algorithm
AcknowledgementAcknowledgementAnonymous authors