85
Lecture 9 Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Lecture 9 Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Embed Size (px)

Citation preview

Lecture 9

Pattern recognition

Bioinformatics Master Course

Bioinformatics Data Analysis and Tools

PatternsSome are easy some are not

• Knitting patterns• Cooking recipes• Pictures (dot plots)• Colour patterns• Maps• Protein structures• Protein sequences• Protein interaction maps

Example of algorithm reuse: Data clustering

• Many biological data analysis problems can be formulated as clustering problems– microarray gene expression data analysis– identification of regulatory binding sites (similarly, splice

junction sites, translation start sites, ......)– (yeast) two-hybrid data analysis (for inference of protein

complexes)– phylogenetic tree clustering (for inference of horizontally

transferred genes)– protein domain identification– identification of structural motifs– prediction reliability assessment of protein structures– NMR peak assignments – ......

Data Clustering Problems

• Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

• cluster identification -- identifying clusters with significantly different features than the background

Application Examples

• Regulatory binding site identification: CRP (CAP) binding site

• Two hybrid data analysis Gene expression data analysis

Are all solvable by the same algorithm!

Other Application Examples

• Phylogenetic tree clustering analysis (Evolutionary trees)

• Protein sidechain packing prediction

• Assessment of prediction reliability of protein structures

• Protein secondary structures

• Protein domain prediction

• NMR peak assignments

• ……

Multivariate statistics – Cluster analysis

Dendrogram

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Cluster criterion

Human EvolutionGaps in the knowledge domain…

Comparing sequences - Similarity Score -

Many properties can be used:

• Nucleotide or amino acid composition

• Isoelectric point

• Molecular weight

• Morphological characters

• But: molecular evolution through sequence alignment

Multivariate statistics – Cluster analysisNow for sequences

Phylogenetic tree

Scores

Similaritymatrix

5×5

Multiple sequence alignment

12345

Similarity criterion

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Multivariate statistics – Cluster analysis

Dendrogram/tree

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Data table

Similarity criterion

Cluster criterion

Multivariate statistics – Cluster analysis

Why do it?• Finding a true typology• Model fitting• Prediction based on groups• Hypothesis testing• Data exploration• Data reduction• Hypothesis generation But you can never prove a

classification/typology!

Cluster analysis – data normalisation/weighting

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Normalisation criterion

12345

C1 C2 C3 C4 C5 C6 ..

Normalised table

Column normalisation x/max

Column range normalise (x-min)/(max-min)

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Cluster analysis – Clustering criteria

Dendrogram (tree)

Scores

Similaritymatrix

5×5

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Ward

Neighbour joining – global measure

Cluster analysis – Clustering criteria

1. Start with N clusters of 1 object each

2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects

3. Most interesting clustering somewhere in between

Dendrogram (tree)

distance

N clusters1 cluster

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Cluster analysis – Ward’s clustering criterion

Per cluster: calculate Error Sum of Squares (ESS)

ESS = x2 – (x)2/n

calculate minimum increase of ESS

Suppose:

Obj Val c l u s t e r i n g ESS

1 1 1 2 3 4 5 0

2 2 1 2 3 4 5 0.5

3 7 1 2 3 4 5 2.5

4 9 1 2 3 4 5 13.1

5 12 1 2 3 4 5 86.8

Partitional Clustering

• divide instances into disjoint clusters

– flat vs. tree structure

• key issues

– how many clusters should there be?

– how should clusters be represented?

Partitional Clustering from aHierarchical Clustering

we can always generate a partitional clustering from ahierarchical clustering by “cutting” the tree at some level

K-Means Clustering• assume our instances are represented by vectors of real values• put k cluster centers in same space as instances• now iteratively move cluster centers

K-Means Clustering• each iteration involves two steps:

– assignment of instances to clusters– re-computation of the means

K-Means Clustering

• in k-means clustering, instances are assigned to one and only one cluster

• can do “soft” k-means clustering via Expectation Maximization (EM) algorithm– each cluster represented by a normal distribution– E step: determine how likely is it that each cluster

“generated” each instance– M step: move cluster centers to maximize likelihood of

instances

Condition 1(contaminant 1)

Condition 2(contaminant 2)

Condition 3(contaminant 3)

Condition n(contaminant n)

Compatibility scores

Ecogenomics

Sample

Algorithm that maps observed clustering behaviour of sampled gene expression data onto the clustering behaviour of contaminant labelled gene expression patterns in the knowledge base:

Genome-Wide Cluster AnalysisEisen dataset

• Eisen et al., PNAS 1998• S. cerevisiae (baker’s yeast)

– all genes (~ 6200) on a single array– measured during several processes

• human fibroblasts– 8600 human transcripts on array– measured at 12 time points during serum stimulation

The Eisen Data

• 79 measurements for yeast data• collected at various time points during– diauxic shift (shutting down genes for

metabolizing sugars, activating those for metabolizing ethanol)

– mitotic cell division cycle– sporulation– temperature shock– reducing shock

The Data• each measurement represents

Log(Redi/Greeni)

where red is the test expression level, and green is the reference level for gene G in the i th experiment

• the expression profile of a gene is the vector of measurements across all experiments

[G1 .. Gn]

The Data

• m genes measured in n experiments:

g1,1 ……… g1,n

g2,1 ………. g2,n

gm,1 ………. gm,n Vector for 1 gene

Eisen et al. Results

• redundant representations of genes cluster together– but individual genes can be distinguished

from related genes by subtle differences in expression

• genes of similar function cluster together– e.g. 126 genes strongly down-regulated in

response to stress

Eisen et al. Results

• 126 genes down-regulated in response to stress

– 112 of the genes encode ribosomal and other proteins related to translation

– agrees with previously known result that yeast responds to favorable growth conditions by increasing the production of ribosomes

0 1 1.5 2 5 6 7 9

1 0 2 1 6.5 6 8 8

1.5 2 0 1 4 4 6 5.5

.

.

.

Graph Adjacency matrix

Graphs - definition

An undirected graph has a symmetric adjacency matrix

A digraph typically has a non-symmetric adjacency matrix

A Theoretical Framework• Representation of a set of n-dimensional (n-D) points as a graph

– each data point represented as a node – each pair of points represented as an edge with a weight defined by the

“distance” between the two points

0 1 1.5 2 5 6 7 9

1 0 2 1 6.5 6 8 8

1.5 2 0 1 4 4 6 5.5

.

.

.

n-D data pointsgraph

representationdistance matrix

A Theoretical Framework

• Spanning tree: a sub-graph that has all nodes connected and has no cycles

• Minimum spanning tree: a spanning tree with the minimum total distance

(a) (b) (c)

Spanning tree• Prim’s algorithm (graph, tree)

– step 1: select an arbitrary node as the current tree – step 2: find an external node that is closest to the tree, and add it with its

corresponding edge into tree– step 3: continue steps 1 and 2 till all nodes are connected in tree.

4

10

6

7

35

8

(e)

4

7

35

(b)

4 4

(c)

7

4

3

(d)

7

(a)

• Kruskal’s algorithm– step 1: consider edges in non-decreasing order – step 2: if edge selected does not form cycle, then add it into tree; otherwise reject– step 3: continue steps 1 and 2 till all nodes are connected in tree.

4

10

6

7

35

8

(f)

4

7

35

(b)

4

(c)

4

3

(d)(a)

3 35

4

3

(e)

5

6

reject

Spanning tree

Multivariate statistics – Cluster analysis

Phylogenetic tree

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Data table

Similarity criterion

Cluster criterion

Multivariate statistics – Cluster analysis

Scores

5×5

12345

C1 C2 C3 C4 C5 C6

Similarity criterion

Cluster criterion

Scores

6×6

Cluster criterion

Make two-way ordered

table using dendrograms

Multivariate statistics – Principal Component Analysis (PCA)

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

Traditionally, principal component analysis is performed on a square symmetric matrix of type SSCP (pure sums of squares and cross products), Covariance (scaled sums of squares and cross products), or Correlation (sums of squares and cross products from standardized data).

The analysis results for objects of type SSCP and Covariance do not differ, since these objects only differ in a global scaling factor. A Correlation object has to be used if the variances of individual variates differ much, or if the units of measurement of the individual variates differ.

The result of a principal component analysis on such objects will be a new object of type PCA

Multivariate statistics – Principal Component Analysis (PCA)

Objectives of principal component analysis To discover or to reduce the dimensionality of the data set.

To identify new meaningful underlying variables.

Multivariate statistics – Principal Component Analysis (PCA)

How to startWe assume that the multi-dimensional data have been collected in a TableOfReal object. If the variances of the individual columns in the TableOfReal differ much or the measurement units of the columns differ then you should first standardize the data.

Performing a principal component analysis on a standardized data matrix has the same effect as performing the analysis on the correlation matrix (the covariance matrix from standardized data is equal to the correlation matrix of these data).

Calculate Eigenvectors and Eigenvalues

We can now make a plot of the eigenvalues to get an indication of the importance of each eigenvalue. The exact contribution of each eigenvalue (or a range of eigenvalues) to the "explained variance" can also be queried: You might also check for the equality of a number of eigenvalues.

Multivariate statistics – Principal Component Analysis (PCA)

Determining the number of components

There are two methods to help you to choose the number of components. Both methods are based on relations between the eigenvalues.

Plot the eigenvalues: If the points on the graph tend to level out (show an "elbow"), these eigenvalues are usually close enough to zero that they can be ignored.

Limit variance accounted for and get associated number of components

Multivariate statistics – Principal Component Analysis (PCA)

Getting the principal componentsPrincipal components are obtained by projecting the

multivariate datavectors on the space spanned by the eigenvectors. This can be done in two ways:

1. Directly from the TableOfReal without first forming a PCA object: You can then draw the Configuration or display its numbers.

2. Standard way: project the TableOfReal onto the PCA's eigenspace.

Multivariate statistics – Principal Component Analysis (PCA)

Mathematical background on principal component analysis

The mathematical technique used in PCA is called eigen analysis: we solve for the eigenvalues and eigenvectors of a square symmetric matrix with sums of squares and cross products. The eigenvector associated with the largest eigenvalue has the same direction as the first principal component. The eigenvector associated with the second largest eigenvalue determines the direction of the second principal component. The sum of the eigenvalues equals the trace of the square matrix and the maximum number of eigenvectors equals the number of rows (or columns) of this matrix.

Multivariate statistics – Principal Component Analysis (PCA)

12345

C1 C2 C3 C4 C5 C6 Similarity Criterion:Correlations

6×6

Calculate eigenvectors with greatest eigenvalues:

•Linear combinations

•Orthogonal

Correlations

Project datapoints ontonew axes (eigenvectors)

12

Multivariate statistics – Principal Component Analysis (PCA)

Multivariate statistics – Principal Component Analysis (PCA)

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Evolution

• Most of bioinformatics is comparative biology

• Comparative biology is based upon evolutionary relationships between compared entities

• Evolutionary relationships are normally depicted in a phylogenetic tree

Where can phylogeny be used

• For example, finding out about orthology versus paralogy

• Predicting secondary structure of RNA

• Studying host-parasite relationships

• Mapping cell-bound receptors onto their binding ligands

• Multiple sequence alignment (e.g. Clustal)

Phylogenetic tree (unrooted)

human

mousefugu

Drosophila

edge

internal node

leaf

OTU – Observed taxonomic unit

Phylogenetic tree (unrooted)

human

mousefugu

Drosophila

root

edge

internal node

leaf

OTU – Observed taxonomic unit

Phylogenetic tree (rooted)

human

mouse

fuguDrosophila

root

edge

internal node (ancestor)

leaf

OTU – Observed taxonomic unit

time

How to root a tree

• Outgroup – place root between distant sequence and rest group

• Midpoint – place root at midpoint of longest path (sum of branches between any two OTUs)

• Gene duplication – place root between paralogous gene copies

f

D

m

h D f m h

f

D

m

h D f m h

f-

h-

f-

h- f- h- f- h-

5

32

1

1

4

1

2

13

1

Combinatoric explosion

# sequences # unrooted # rooted trees trees

2 1 13 1 34 3 155 15 1056 105 9457 945 10,3958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,425

Tree distances

human x

mouse 6 x

fugu 7 3 x

Drosophila 14 10 9 x

human

mouse

fugu

Drosophila

5

1

1

2

6human

mouse

fuguDrosophila

Evolutionary (sequence distance) = sequence dissimilarity

Phylogeny methods

• Parsimony – fewest number of evolutionary events (mutations) – relatively often fails to reconstruct correct phylogeny

• Distance based – pairwise distances

• Maximum likelihood – L = Pr[Data|Tree]

Parsimony & DistanceSequences 1 2 3 4 5 6 7Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t

human x

mouse 2 x

fugu 3 4 x

Drosophila 5 5 3 x

human

mouse

fuguDrosophila

Drosophila

fugu

mouse

human

12

3 7

64 5

Drosophila

fugu

mouse

human

2

11

12

parsimony

distance

Maximum likelihood

• If data=alignment, hypothesis = tree, and under a given evolutionary model,

maximum likelihood selects the hypothesis (tree) that maximises the observed data

• Extremely time consuming method

• We also can test the relative fit to the tree of different models (Huelsenbeck & Rannala, 1997)

Bayesian methods

• Calculates the posterior probability of a tree (Huelsenbeck et al., 2001) –- probability that tree is true tree given evolutionary model

• Most computer intensive technique• Feasible thanks to Markov chain Monte Carlo

(MCMC) numerical technique for integrating over probability distributions

• Gives confidence number (posterior probability) per node

Distance methods: fastest

• Clustering criterion using a distance matrix

• Distance matrix filled with alignment scores (sequence identity, alignment scores, E-values, etc.)

• Cluster criterion

Phylogenetic tree by Distance methods (Clustering)

Phylogenetic tree

Scores

Similaritymatrix

5×5

Multiple alignment

12345

Similarity criterion

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

5×5

1

2

3

4

5

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Cluster analysis – Clustering criteria

Phylogenetic tree

Scores

Similaritymatrix

5×5

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Ward

Neighbour joining – global measure

Neighbour joining

• Global measure – keeps total branch length minimal, tends to produce a tree with minimal total branch length

• At each step, join two nodes such that distances are minimal (criterion of minimal evolution)

• Agglomerative algorithm

• Leads to unrooted tree

Neighbour joining• The neighbor-joining method is a special case of the star

decomposition method. In contrast to cluster analysis neighbor-joining keeps track of nodes on a tree rather than taxa or clusters of taxa. The raw data are provided as a distance matrix and the initial tree is a star tree. Then a modified distance matrix is constructed in which the separation between each pair of nodes is adjusted on the basis of their average divergeance from all other nodes. The tree is constructed by linking the least-distant pair of nodes in this modified matrix. When two nodes are linked, their common ancestral node is added to the tree and the terminal nodes with their respective branches are removed from the tree. This pruning process converts the newly added common ancestor into a terminal node on a tree of reduced size. At each stage in the process two terminal nodes are replaced by one new node. The process is complete when two nodes remain, separated by a single branch.

Neighbour joining

Negative branch lengthsAs the neighbor-joining algorithm seeks to represent the data in the form of an additive tree, it can assign a negative length to the branch. Here the interpretation of branch lengths as an estimated number of substititions gets into difficulties. When this occurs it is adviced to set the branch length to zero and transfer the difference to the adjacent branch length so that the total distance between an adjacent pair of terminal nodes remains unaffected. This does not alter the overall topology of the tree (Kuhner and Felsenstein, 1994).

Neighbour joining

A good description (and example) can be found at:

http://www.icp.ucl.ac.be/~opperd/private/neighbor.html

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node22.html(with calculation steps)

Neighbour joining

xx

y

x

y

xy xy

x

(a) (b) (c)

(d) (e) (f)

At each step all possible ‘neighbour joinings’ are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

How to assess confidence in tree

• Bayesian method – time consuming– The Bayesian posterior probabilities (BPP) are assigned to

internal branches in consensus tree – Bayesian Markov chain Monte Carlo (MCMC) analytical software

such as MrBayes (Huelsenbeck and Ronquist, 2001) and BAMBE (Simon and Larget,1998) is now commonly used

– Uses all the data

• Distance method – bootstrap:– Select multiple alignment columns with replacement– Recalculate tree– Compare branches with original (target) tree– Repeat 100-1000 times, so calculate 100-1000 different trees– How often is branching (point between 3 nodes) preserved for

each internal node?– Uses samples of the data

The Bootstrap

1 2 3 4 5 6 7 8 C C V K V I Y SM A V R L I F SM C L R L L F T

3 4 3 8 6 6 8 6 V K V S I I S IV R V S I I S IL R L T L L T L

1

2

3

1

2

3

Original

Scrambled

4

5

1

5

2x 3x

Non-supportive