58
Phylogenetic Analysis Introduction to bioinformatics Stinus Lindgreen [email protected] Bioinformatics Centre, University of Copenhagen

Phylogenetic Analysis

  • Upload
    osias

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Phylogenetic Analysis. Introduction to bioinformatics Stinus Lindgreen [email protected] Bioinformatics Centre, University of Copenhagen. Outline of the lecture. What is a phylogeny? Why and how to interpret them Programs: PHYLIP, PAUP* and BioEdit Building a tree 1: Multiple alignment - PowerPoint PPT Presentation

Citation preview

Page 1: Phylogenetic Analysis

Phylogenetic Analysis

Introduction to bioinformatics

Stinus [email protected]

Bioinformatics Centre, University of Copenhagen

Page 2: Phylogenetic Analysis

Outline of the lecture What is a phylogeny? Why and how to interpret them Programs: PHYLIP, PAUP* and BioEdit Building a tree 1: Multiple alignment Building a tree 2: The model Building a tree 3: Construction Building a tree 4: Evaluation

Page 3: Phylogenetic Analysis

Nothing in Biology Makes Sense Except in the Light of Evolution

Theodosius Dobzhansky (1900-1975)

Page 4: Phylogenetic Analysis

Phylogeny Phylogenetic inference predicts a tree based

on characters (of some sort) Some variation needed Group together similar species/genes Connect to most common ancestor

Unrooted tree: Just show connections Rooted tree: Direction of evolution Branch lengths can show divergence

Page 5: Phylogenetic Analysis

Before sequences Phylogenetic trees show evolutionary

relationships Existed longer than sequencing methods Previously based on morphological characters Still partly today – at least for checking Mainly based on biological sequences

DNA or protein Base phylogeny on mutations

Page 6: Phylogenetic Analysis

Morphological tree

Page 7: Phylogenetic Analysis

Modern tree

A A G C G

X

X

Page 8: Phylogenetic Analysis

Some pitfalls Determining phylogeny is important for

understanding biology But also a very difficult problem Beware of incorrect trees Important to understand models and methods The programs are helpful tools

The result is only as good as the alignment

Page 9: Phylogenetic Analysis

Assumptions

Basic concepts of evolutionary theory Relation to common ancestor Phylogenetics represented by bifurcating tree Mutations occur over evolutionary time

Necessary to make phylogenetic inference possible

Page 10: Phylogenetic Analysis

Tree of Life

Page 11: Phylogenetic Analysis

Interpretation Know your model

Both evolutionary and for tree construction Know the assumptions of the model

Evolution independent? Identical between sites? The same for all sequences?

Are the sequences correct? And are they representative? And are they homologous? Is the multiple alignment correct?What you get out is no better than what you put in

Page 12: Phylogenetic Analysis

Some biological pitfallsDon’t make hasty conclusions! Does your tree contradict common sense?

Then it’s probably wrong! Differentiate between the homologs Orthologs

Speciation, common ancestor, similar function Paralogs

Gene duplication, within 1 organism, differing functions

Xenologs Horizontal gene transfer – hard to tell, similar function

Page 13: Phylogenetic Analysis

Software

Today we’ll look at the programs before the methods

Some programs for phylogenetic analysis A multiple alignment program:

Clustal, T-Coffee, MAFFT, Muscle… A phylogenetic program:

Phylip, PAUP*, MacClade, BioEdit… Visualizing the tree:

TreeView, NJplot

Page 14: Phylogenetic Analysis

PAUP* Commercial package Apparently good Many different methods and analysis methods But since we don’t own a copy…

Similarly: MacClade only works on Macintosh…

Page 15: Phylogenetic Analysis

PHYLIP Free package Many programs Both distance and character based Bootstrapping possible But: It can be a little difficult No graphical user interface And you will need to run many programs

Page 16: Phylogenetic Analysis

BioEdit Has phylogeny methods built in Can call Phylip routines No need for you to learn the command line But no bootstrapping… (as far as I know)

Point and click: Select the sequences in the alignment Choose the wanted phylogeny Voila!

Page 17: Phylogenetic Analysis

PhyloWin Another free program Simple, not many possibilities But you can make bootstrapping

Page 18: Phylogenetic Analysis

Getting the software Install BioEdit, PHYLIP, PhyloWin and NJplot Links on the wiki

Page 19: Phylogenetic Analysis

Constructing a treeTo make a phylogenetic tree, four steps are

needed:1. Perform multiple alignment2. Choose your model3. Build the tree4. Evaluate the quality

A brief note: Ideally: Parallel alignment and phylogenetic

inference Very difficult – but it has been pursued

Page 20: Phylogenetic Analysis

1) The multiple alignment

Already discussedSome notes: Recall that MA programs are not exact

Some manual editing often necessary Consider the algorithm used

Does it consider the phylogeny of the data? Clustal’s guide tree: Not correct phylogeny

What parameters are used? Solve ambiguities, remove near-identical

sequences Gappy regions, identical sequences can bias the result

Page 21: Phylogenetic Analysis

2) The model

The model describes the data Evolutionary events Overall mutability Evolutionary model?

Crucial – both for alignment and tree building Are you looking at nucleotides or amino acids?

Where do we get most information? Know the basis for the chosen model

Page 22: Phylogenetic Analysis

Nucleotide models Create 4×4 matrix Either fixed cost

Character state Or rate matrices

Probabilities Used for different kinds of tree

estimations

Include site specific information Third codon position more variable

Page 23: Phylogenetic Analysis

Nucleotide model 1 Fixed cost for transitions and

transversion E.g. transversions are twice as costly

as transitions For a tree: Count the number of

transitions/transversions Calculate cost Tends to minimize number of

transversion Cluster transitions

A C G T

A - 2 1 2

C 2 - 2 1

G 1 2 - 2

T 2 1 2 -

Page 24: Phylogenetic Analysis

Nucleotide model 2 Simple substitution rate matrix Assume same rates AB and BA Assume all mutations equally likely: Rate α The Jukes-Cantor model

A C G T

A -3α α α α

C α -3α α α

G α α -3α α

T α α α -3α

Page 25: Phylogenetic Analysis

Nucleotide model 3

A C G T

A-

(α2+α1)α2 α1 α2

C α2

-(α2+α1)

α2 α1

G α1 α2

-(α2+α1)

α2

T α2 α1 α2 -(α2+α1)

More advanced rate matrix Include transitions/tranversions Rates α1 and α2

The Kimura 2-parameter model

Page 26: Phylogenetic Analysis

Amino acid models A 20×20 substitution matrix The BLOSUM matrices

Fixed cost matrices Or the PAM matrices

Rate matrices Described last week

Page 27: Phylogenetic Analysis

3) Building the tree

We have the sequences, the alignment and the model

Find the best tree What is the best tree? Two main strategies: Distance based

Look at dissimilarities (=distances) Character based

Look at the data

Page 28: Phylogenetic Analysis

Problems with trees The number of possible trees grows

exponentially For 15 taxa: 2.13·1014 possibilities… How to search?

Branch and Bound Branch swapping

Rooting the tree Not a simple problem

All the following methods produce unrooted trees Use an outgroup Midpoint of longest branch

Page 29: Phylogenetic Analysis

Distance methods Some sequences more similar than others Closely related sequences should be close in

the tree Abstract view on the data

Loss of information is usually a bad sign Only use the distances between sequences

Recall Clustal All methods start with a distance matrix

Page 30: Phylogenetic Analysis

Distance methods Can we get the correct answer? Yes, if all mutation events were present But: After one mutation, the site is ”saturated”

Additional mutations do not give additional info

A B C: Distance 2A C: Distance 1 And mutations back will fool the methodA B A: Distance 2A A: Distance 0

Page 31: Phylogenetic Analysis

UPGMA

Unweighted Pair Group Method with Arithmetic Mean Unweighted: The distances are used as they are Pair: Find the two closest elements Group: Put them together in a new group Arithmetic Mean: Gives distances from the new

group

Correct tree assuming a molecular clock Evolutionary divergence time can be found from

mutations Mutation rates are constant

Page 32: Phylogenetic Analysis

UPGMA illustrated Find two closest: A and D Create a new group [A+D] Update distances:

72

682

BDBABD][A

A B C D E

A - 8 3 2 5

B - - 5 6 6

C - - - 7 5

D - - - - 3

E - - - - -

A+D

B C E

A+D

- 7 5 4

B - - 5 6

C - - - 5

E - - - -

Repeat for all sequences Next time: Connect [A+D]

with E

Page 33: Phylogenetic Analysis

Trying UPGMA Go to the wiki and do the UPGMA exercise

Page 34: Phylogenetic Analysis

Neighbour joining A little like UPGMA Difference: NJ does not assume a molecular

clock But it assumes an additive tree

Distance between two leaves is the sum of the edges Find the closest pair that is most apart from the

rest of the tree Connect pair and update distances

A little advanced: Take the overall distance to the rest of the tree into account

Corrects for varying mutation Fast and can give good results

Page 35: Phylogenetic Analysis

Fitch-MargoliashFM method We have the pairwise distances Each branch in the tree has a length The length of all paths can be found Optimize tree by moving internal nodes

around The best fit minimizes the overall error

The minimum squared deviation

ij

2ijij )p(d

Page 36: Phylogenetic Analysis

Minimum Evolution

The ME method Find the shortest tree

Count number of changes Similar to FM but only looks at branches

FM

ij

2ijij )l(d

A

B

B

A

ME

Page 37: Phylogenetic Analysis

Trying NJ Go to the wiki and do the NJ exercise

Page 38: Phylogenetic Analysis

Character methods Use the data (the actual characters) All information at hand More advanced, slower, but also more

accurate Maximum Parsimony (MP)

Occam’s razor: Simplest explanation Maximum Likelihood (ML)

Advanced statistical method Most probable tree given the data and the model

Page 39: Phylogenetic Analysis

Maximum parsimony How does evolution work? Assumption: Path of least resistance True evolution gives rise to fewest changes

The tree we want: Describe the given sequences by fewest

changes The ancestral nodes must be as similar as possible

Predict a tree Count the number of changes needed

Page 40: Phylogenetic Analysis

MP illustrated

A C G G C

{A,C} {G}

{A,C,G}

{C}

Page 41: Phylogenetic Analysis

MP illustrated

A C G G C

{A,C} {G}

{A,C,G}

{C}

X

X

Cost: 2 changes

Page 42: Phylogenetic Analysis

MP illustrated

A C G G C

C G

C

C

CG

CA

Page 43: Phylogenetic Analysis

Maximum Likelihood Given the data, predict the most probable

model Can optimize both tree and substitution model

We know the sequences What is the most likely substitution rates?

Estimate from the alignment (and the phylogeny) And what is the most likely tree?

Estimate from alignment and substitution rates Computationally heavy and rather slow Normally good results

Page 44: Phylogenetic Analysis

Maximum Likelihood General practice: Optimize model then tree Calculate probability for each alignment column Combine to probability for entire alignment Averages over low and high probability sites Likelihood of column given tree

A A C

A

A

A A C

C

A

A A C

G

A

L=P +P +P +…

Page 45: Phylogenetic Analysis

Maximum likelihood Then repeat this for all possible tree topologies And all possible assignments to internal nodes And then choose the combination that gives

the highest probability…

Clearly very difficult

Page 46: Phylogenetic Analysis

MP and ML exercise Go to the wiki and do the MP and ML exercises

Page 47: Phylogenetic Analysis

Summary of methods

Distance Character based

Clustering

UPGMANeighbour Joining

Optimality criterion

Least SquaresMinimum evolution

Maximum parsimonyMaximum likelihood(Bayesian statistics)

Page 48: Phylogenetic Analysis

The differences Sometimes the differences can seem minimal They affect the tree – but the same result is

possible

UPGMA and NJ Minimize the overall length of the treeMaximum parsimony Finds tree with fewest changesMaximum likelihood Maximizes the probability of the tree given the data

Page 49: Phylogenetic Analysis

4) Evaluating trees

How good is the predicted tree?

Some sequence variation needed Is the signal strong enough?There are so many possible trees Are there many trees similar to the prediction?

Which one to choose? Is the tree robust?

Does it change much when e.g. removing a sequence?

Page 50: Phylogenetic Analysis

Randomization Is it possible that tree is just random? Permute the columns of the alignment

i.e. shuffle the characters in a column Build a new tree Is it (partly) identical? If the tree is just as likely to be random, then

don’t put too much faith in it

Page 51: Phylogenetic Analysis

Bootstrapping The story of Baron von Münchausen He pulled himself out of a swamp by his

bootstraps The idea: Evaluate the quality of the result

using the same data all over again Make a large number of new datasets Create phylogenetic tree Observe the number of times clades are made

Page 52: Phylogenetic Analysis

Bootstrapping The datasets should be similar Thereby: The trees are comparable Alignments of same size (length and sequences) Non-parametric: Sample with replacement

Choose a random column and add new alignment Parametric: Simulate new datasets

Use model that look like your data

Characteristics are preserved (unlike randomization)

Page 53: Phylogenetic Analysis

Bootstrap example Non-parametric

bootstrapping We have an alignment:A: A G G C U C C A A AB: A G G U U C G A A AC: A G C C C C G A A AD: A U U U C C G A A C#: 0 1 2 0 3 0 1 2 0 1

Sample columns:A: G G G U U U C A A AB: G G G U U U G A A A C: G C C C C C G A A AD: U U U C C C G A A C

A B C D

A - - - -

B 1 - - -

C 5 5 - -

D 8 7 4 -

A

B

C

D

Page 54: Phylogenetic Analysis

Bootstrap example Sample 2:A: A U U C C C C A A AB: A U U C C G G A A AC: A C C C C G G A A AD: A C C C C G G C C C

A B C D

A - - - -

B 2 - - -

C 4 2 - -

D 7 5 3 -

A

B

C

D

Page 55: Phylogenetic Analysis

Bootstrap example Sample 3:A: A C C C A A G G C CB: A C C G A A G G U UC: A C C G A A C C C C D: A C C G C C U U U U

A B C D

A - - - -

B 3 - - -

C 3 4 - -

D 7 4 6 -

A

B

C

D

Page 56: Phylogenetic Analysis

Bootstrap example Calculate consensus tree

Can be done on many ways Put the bootstrap number at each branch point

The proportions of times this branch is observed Of course, more than three samples needed

A

B

C

D

1.0

0.66

Page 57: Phylogenetic Analysis

Bootstrapping exercise Do the bootstrapping exercise on the wiki

Page 58: Phylogenetic Analysis

Summary What is phylogenetic inference? What can a phylogenetic tree be used for? Be aware of the multiple alignment The different models Tree building methods: NJ, UPGMA, ML and MP Evaluating trees: Bootstrapping Programs: Phylip, PAUP*,PhyloWin and BioEdit

Next time: Gene finding (with Anders Krogh)Then RNA structure prediction with me again