Upload
neil-mcdowell
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
PatternsSome are easy some are not
• Knitting patterns• Cooking recipes• Pictures (dot plots)• Colour patterns• Maps• Protein structures• Protein sequences• Protein interaction maps
Example of algorithm reuse: Data clustering
• Many biological data analysis problems can be formulated as clustering problems– microarray gene expression data analysis– identification of regulatory binding sites (similarly, splice
junction sites, translation start sites, ......)– (yeast) two-hybrid data analysis (for inference of protein
complexes)– phylogenetic tree clustering (for inference of horizontally
transferred genes)– protein domain identification– identification of structural motifs– prediction reliability assessment of protein structures– NMR peak assignments – ......
Data Clustering Problems
• Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”
• cluster identification -- identifying clusters with significantly different features than the background
Application Examples
• Regulatory binding site identification: CRP (CAP) binding site
• Two hybrid data analysis Gene expression data analysis
Are all solvable by the same algorithm!
Other Application Examples
• Phylogenetic tree clustering analysis (Evolutionary trees)
• Protein sidechain packing prediction
• Assessment of prediction reliability of protein structures
• Protein secondary structures
• Protein domain prediction
• NMR peak assignments
• ……
Multivariate statistics – Cluster analysis
Dendrogram
Scores
Similaritymatrix
5×5
12345
C1 C2 C3 C4 C5 C6 ..
Raw table
Similarity criterion
Cluster criterion
Comparing sequences - Similarity Score -
Many properties can be used:
• Nucleotide or amino acid composition
• Isoelectric point
• Molecular weight
• Morphological characters
• But: molecular evolution through sequence alignment
Multivariate statistics – Cluster analysisNow for sequences
Phylogenetic tree
Scores
Similaritymatrix
5×5
Multiple sequence alignment
12345
Similarity criterion
Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ
Lactate dehydrogenase multiple alignment
Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000
Multivariate statistics – Cluster analysis
Dendrogram/tree
Scores
Similaritymatrix
5×5
12345
C1 C2 C3 C4 C5 C6 ..
Data table
Similarity criterion
Cluster criterion
Multivariate statistics – Cluster analysis
Why do it?• Finding a true typology• Model fitting• Prediction based on groups• Hypothesis testing• Data exploration• Data reduction• Hypothesis generation But you can never prove a
classification/typology!
Cluster analysis – data normalisation/weighting
12345
C1 C2 C3 C4 C5 C6 ..
Raw table
Normalisation criterion
12345
C1 C2 C3 C4 C5 C6 ..
Normalised table
Column normalisation x/max
Column range normalise (x-min)/(max-min)
Cluster analysis – (dis)similarity matrix
Scores
Similaritymatrix
5×5
12345
C1 C2 C3 C4 C5 C6 ..
Raw table
Similarity criterion
Di,j = (k | xik – xjk|r)1/r Minkowski metrics
r = 2 Euclidean distancer = 1 City block distance
Cluster analysis – Clustering criteria
Dendrogram (tree)
Scores
Similaritymatrix
5×5
Cluster criterion
Single linkage - Nearest neighbour
Complete linkage – Furthest neighbour
Group averaging – UPGMA
Ward
Neighbour joining – global measure
Cluster analysis – Clustering criteria
1. Start with N clusters of 1 object each
2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects
3. Most interesting clustering somewhere in between
Dendrogram (tree)
distance
N clusters1 cluster
Single linkage clustering (nearest neighbour)
Char 1
Char 2
Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster
Cluster analysis – Ward’s clustering criterion
Per cluster: calculate Error Sum of Squares (ESS)
ESS = x2 – (x)2/n
calculate minimum increase of ESS
Suppose:
Obj Val c l u s t e r i n g ESS
1 1 1 2 3 4 5 0
2 2 1 2 3 4 5 0.5
3 7 1 2 3 4 5 2.5
4 9 1 2 3 4 5 13.1
5 12 1 2 3 4 5 86.8
Partitional Clustering
• divide instances into disjoint clusters
– flat vs. tree structure
• key issues
– how many clusters should there be?
– how should clusters be represented?
Partitional Clustering from aHierarchical Clustering
we can always generate a partitional clustering from ahierarchical clustering by “cutting” the tree at some level
K-Means Clustering• assume our instances are represented by vectors of real values• put k cluster centers in same space as instances• now iteratively move cluster centers
K-Means Clustering• each iteration involves two steps:
– assignment of instances to clusters– re-computation of the means
K-Means Clustering
• in k-means clustering, instances are assigned to one and only one cluster
• can do “soft” k-means clustering via Expectation Maximization (EM) algorithm– each cluster represented by a normal distribution– E step: determine how likely is it that each cluster
“generated” each instance– M step: move cluster centers to maximize likelihood of
instances
Condition 1(contaminant 1)
Condition 2(contaminant 2)
Condition 3(contaminant 3)
Condition n(contaminant n)
…
Compatibility scores
…
Ecogenomics
Sample
Algorithm that maps observed clustering behaviour of sampled gene expression data onto the clustering behaviour of contaminant labelled gene expression patterns in the knowledge base:
Genome-Wide Cluster AnalysisEisen dataset
• Eisen et al., PNAS 1998• S. cerevisiae (baker’s yeast)
– all genes (~ 6200) on a single array– measured during several processes
• human fibroblasts– 8600 human transcripts on array– measured at 12 time points during serum stimulation
The Eisen Data
• 79 measurements for yeast data• collected at various time points during– diauxic shift (shutting down genes for
metabolizing sugars, activating those for metabolizing ethanol)
– mitotic cell division cycle– sporulation– temperature shock– reducing shock
The Data• each measurement represents
Log(Redi/Greeni)
where red is the test expression level, and green is the reference level for gene G in the i th experiment
• the expression profile of a gene is the vector of measurements across all experiments
[G1 .. Gn]
The Data
• m genes measured in n experiments:
g1,1 ……… g1,n
g2,1 ………. g2,n
gm,1 ………. gm,n Vector for 1 gene
Eisen et al. Results
• redundant representations of genes cluster together– but individual genes can be distinguished
from related genes by subtle differences in expression
• genes of similar function cluster together– e.g. 126 genes strongly down-regulated in
response to stress
Eisen et al. Results
• 126 genes down-regulated in response to stress
– 112 of the genes encode ribosomal and other proteins related to translation
– agrees with previously known result that yeast responds to favorable growth conditions by increasing the production of ribosomes
0 1 1.5 2 5 6 7 9
1 0 2 1 6.5 6 8 8
1.5 2 0 1 4 4 6 5.5
.
.
.
Graph Adjacency matrix
Graphs - definition
An undirected graph has a symmetric adjacency matrix
A digraph typically has a non-symmetric adjacency matrix
A Theoretical Framework• Representation of a set of n-dimensional (n-D) points as a graph
– each data point represented as a node – each pair of points represented as an edge with a weight defined by the
“distance” between the two points
0 1 1.5 2 5 6 7 9
1 0 2 1 6.5 6 8 8
1.5 2 0 1 4 4 6 5.5
.
.
.
n-D data pointsgraph
representationdistance matrix
A Theoretical Framework
• Spanning tree: a sub-graph that has all nodes connected and has no cycles
• Minimum spanning tree: a spanning tree with the minimum total distance
(a) (b) (c)
Spanning tree• Prim’s algorithm (graph, tree)
– step 1: select an arbitrary node as the current tree – step 2: find an external node that is closest to the tree, and add it with its
corresponding edge into tree– step 3: continue steps 1 and 2 till all nodes are connected in tree.
4
10
6
7
35
8
(e)
4
7
35
(b)
4 4
(c)
7
4
3
(d)
7
(a)
• Kruskal’s algorithm– step 1: consider edges in non-decreasing order – step 2: if edge selected does not form cycle, then add it into tree; otherwise reject– step 3: continue steps 1 and 2 till all nodes are connected in tree.
4
10
6
7
35
8
(f)
4
7
35
(b)
4
(c)
4
3
(d)(a)
3 35
4
3
(e)
5
6
reject
Spanning tree
Multivariate statistics – Cluster analysis
Phylogenetic tree
Scores
Similaritymatrix
5×5
12345
C1 C2 C3 C4 C5 C6 ..
Data table
Similarity criterion
Cluster criterion
Multivariate statistics – Cluster analysis
Scores
5×5
12345
C1 C2 C3 C4 C5 C6
Similarity criterion
Cluster criterion
Scores
6×6
Cluster criterion
Make two-way ordered
table using dendrograms
Multivariate statistics – Principal Component Analysis (PCA)
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
Traditionally, principal component analysis is performed on a square symmetric matrix of type SSCP (pure sums of squares and cross products), Covariance (scaled sums of squares and cross products), or Correlation (sums of squares and cross products from standardized data).
The analysis results for objects of type SSCP and Covariance do not differ, since these objects only differ in a global scaling factor. A Correlation object has to be used if the variances of individual variates differ much, or if the units of measurement of the individual variates differ.
The result of a principal component analysis on such objects will be a new object of type PCA
Multivariate statistics – Principal Component Analysis (PCA)
Objectives of principal component analysis To discover or to reduce the dimensionality of the data set.
To identify new meaningful underlying variables.
Multivariate statistics – Principal Component Analysis (PCA)
How to startWe assume that the multi-dimensional data have been collected in a TableOfReal object. If the variances of the individual columns in the TableOfReal differ much or the measurement units of the columns differ then you should first standardize the data.
Performing a principal component analysis on a standardized data matrix has the same effect as performing the analysis on the correlation matrix (the covariance matrix from standardized data is equal to the correlation matrix of these data).
Calculate Eigenvectors and Eigenvalues
We can now make a plot of the eigenvalues to get an indication of the importance of each eigenvalue. The exact contribution of each eigenvalue (or a range of eigenvalues) to the "explained variance" can also be queried: You might also check for the equality of a number of eigenvalues.
Multivariate statistics – Principal Component Analysis (PCA)
Determining the number of components
There are two methods to help you to choose the number of components. Both methods are based on relations between the eigenvalues.
Plot the eigenvalues: If the points on the graph tend to level out (show an "elbow"), these eigenvalues are usually close enough to zero that they can be ignored.
Limit variance accounted for and get associated number of components
Multivariate statistics – Principal Component Analysis (PCA)
Getting the principal componentsPrincipal components are obtained by projecting the
multivariate datavectors on the space spanned by the eigenvectors. This can be done in two ways:
1. Directly from the TableOfReal without first forming a PCA object: You can then draw the Configuration or display its numbers.
2. Standard way: project the TableOfReal onto the PCA's eigenspace.
Multivariate statistics – Principal Component Analysis (PCA)
Mathematical background on principal component analysis
The mathematical technique used in PCA is called eigen analysis: we solve for the eigenvalues and eigenvectors of a square symmetric matrix with sums of squares and cross products. The eigenvector associated with the largest eigenvalue has the same direction as the first principal component. The eigenvector associated with the second largest eigenvalue determines the direction of the second principal component. The sum of the eigenvalues equals the trace of the square matrix and the maximum number of eigenvectors equals the number of rows (or columns) of this matrix.
Multivariate statistics – Principal Component Analysis (PCA)
12345
C1 C2 C3 C4 C5 C6 Similarity Criterion:Correlations
6×6
Calculate eigenvectors with greatest eigenvalues:
•Linear combinations
•Orthogonal
Correlations
Project datapoints ontonew axes (eigenvectors)
12
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense except in the light of Biology”
Bioinformatics
Evolution
• Most of bioinformatics is comparative biology
• Comparative biology is based upon evolutionary relationships between compared entities
• Evolutionary relationships are normally depicted in a phylogenetic tree
Where can phylogeny be used
• For example, finding out about orthology versus paralogy
• Predicting secondary structure of RNA
• Studying host-parasite relationships
• Mapping cell-bound receptors onto their binding ligands
• Multiple sequence alignment (e.g. Clustal)
Phylogenetic tree (unrooted)
human
mousefugu
Drosophila
edge
internal node
leaf
OTU – Observed taxonomic unit
Phylogenetic tree (unrooted)
human
mousefugu
Drosophila
root
edge
internal node
leaf
OTU – Observed taxonomic unit
Phylogenetic tree (rooted)
human
mouse
fuguDrosophila
root
edge
internal node (ancestor)
leaf
OTU – Observed taxonomic unit
time
How to root a tree
• Outgroup – place root between distant sequence and rest group
• Midpoint – place root at midpoint of longest path (sum of branches between any two OTUs)
• Gene duplication – place root between paralogous gene copies
f
D
m
h D f m h
f
D
m
h D f m h
f-
h-
f-
h- f- h- f- h-
5
32
1
1
4
1
2
13
1
Combinatoric explosion
# sequences # unrooted # rooted trees trees
2 1 13 1 34 3 155 15 1056 105 9457 945 10,3958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,425
Tree distances
human x
mouse 6 x
fugu 7 3 x
Drosophila 14 10 9 x
human
mouse
fugu
Drosophila
5
1
1
2
6human
mouse
fuguDrosophila
Evolutionary (sequence distance) = sequence dissimilarity
Phylogeny methods
• Parsimony – fewest number of evolutionary events (mutations) – relatively often fails to reconstruct correct phylogeny
• Distance based – pairwise distances
• Maximum likelihood – L = Pr[Data|Tree]
Parsimony & DistanceSequences 1 2 3 4 5 6 7Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t
human x
mouse 2 x
fugu 3 4 x
Drosophila 5 5 3 x
human
mouse
fuguDrosophila
Drosophila
fugu
mouse
human
12
3 7
64 5
Drosophila
fugu
mouse
human
2
11
12
parsimony
distance
Maximum likelihood
• If data=alignment, hypothesis = tree, and under a given evolutionary model,
maximum likelihood selects the hypothesis (tree) that maximises the observed data
• Extremely time consuming method
• We also can test the relative fit to the tree of different models (Huelsenbeck & Rannala, 1997)
Bayesian methods
• Calculates the posterior probability of a tree (Huelsenbeck et al., 2001) –- probability that tree is true tree given evolutionary model
• Most computer intensive technique• Feasible thanks to Markov chain Monte Carlo
(MCMC) numerical technique for integrating over probability distributions
• Gives confidence number (posterior probability) per node
Distance methods: fastest
• Clustering criterion using a distance matrix
• Distance matrix filled with alignment scores (sequence identity, alignment scores, E-values, etc.)
• Cluster criterion
Phylogenetic tree by Distance methods (Clustering)
Phylogenetic tree
Scores
Similaritymatrix
5×5
Multiple alignment
12345
Similarity criterion
Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ
Lactate dehydrogenase multiple alignment
Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000
Cluster analysis – (dis)similarity matrix
Scores
Similaritymatrix
5×5
1
2
3
4
5
C1 C2 C3 C4 C5 C6 ..
Raw table
Similarity criterion
Di,j = (k | xik – xjk|r)1/r Minkowski metrics
r = 2 Euclidean distancer = 1 City block distance
Cluster analysis – Clustering criteria
Phylogenetic tree
Scores
Similaritymatrix
5×5
Cluster criterion
Single linkage - Nearest neighbour
Complete linkage – Furthest neighbour
Group averaging – UPGMA
Ward
Neighbour joining – global measure
Neighbour joining
• Global measure – keeps total branch length minimal, tends to produce a tree with minimal total branch length
• At each step, join two nodes such that distances are minimal (criterion of minimal evolution)
• Agglomerative algorithm
• Leads to unrooted tree
Neighbour joining• The neighbor-joining method is a special case of the star
decomposition method. In contrast to cluster analysis neighbor-joining keeps track of nodes on a tree rather than taxa or clusters of taxa. The raw data are provided as a distance matrix and the initial tree is a star tree. Then a modified distance matrix is constructed in which the separation between each pair of nodes is adjusted on the basis of their average divergeance from all other nodes. The tree is constructed by linking the least-distant pair of nodes in this modified matrix. When two nodes are linked, their common ancestral node is added to the tree and the terminal nodes with their respective branches are removed from the tree. This pruning process converts the newly added common ancestor into a terminal node on a tree of reduced size. At each stage in the process two terminal nodes are replaced by one new node. The process is complete when two nodes remain, separated by a single branch.
Neighbour joining
Negative branch lengthsAs the neighbor-joining algorithm seeks to represent the data in the form of an additive tree, it can assign a negative length to the branch. Here the interpretation of branch lengths as an estimated number of substititions gets into difficulties. When this occurs it is adviced to set the branch length to zero and transfer the difference to the adjacent branch length so that the total distance between an adjacent pair of terminal nodes remains unaffected. This does not alter the overall topology of the tree (Kuhner and Felsenstein, 1994).
Neighbour joining
A good description (and example) can be found at:
http://www.icp.ucl.ac.be/~opperd/private/neighbor.html
http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node22.html(with calculation steps)
Neighbour joining
xx
y
x
y
xy xy
x
(a) (b) (c)
(d) (e) (f)
At each step all possible ‘neighbour joinings’ are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.
How to assess confidence in tree
• Bayesian method – time consuming– The Bayesian posterior probabilities (BPP) are assigned to
internal branches in consensus tree – Bayesian Markov chain Monte Carlo (MCMC) analytical software
such as MrBayes (Huelsenbeck and Ronquist, 2001) and BAMBE (Simon and Larget,1998) is now commonly used
– Uses all the data
• Distance method – bootstrap:– Select multiple alignment columns with replacement– Recalculate tree– Compare branches with original (target) tree– Repeat 100-1000 times, so calculate 100-1000 different trees– How often is branching (point between 3 nodes) preserved for
each internal node?– Uses samples of the data