88
Phylogenetics Reconstructing the Tree of Life Carol E. Lee University of Wisconsin Copyright©2020; do not upload without permission

Phylogenetics - University of Wisconsin–Madison

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phylogenetics - University of Wisconsin–Madison

PhylogeneticsReconstructing the Tree of Life

Carol E. LeeUniversity of WisconsinCopyright©2020; do not upload without permission

Page 2: Phylogenetics - University of Wisconsin–Madison

• In the Speciation lecture, I talked about a “Phylogenetic Species Concept”

–What is a “Phylogeny?”–How do you construct one?–Why on earth should I care?

2

Page 3: Phylogenetics - University of Wisconsin–Madison

Why you should care:• All biological relationships can be determined by

constructing phylogenies: Even if phylogenies are not always the best way to define species boundaries, they do indeed tell you the genetic and evolutionary relationships among groups and individuals

– Your ancestry

– Diseases—figure out evolutionary origins and evolutionary pathways of disease, like HIV, Ebola, SARS, etc.

– Crops and livestock (food security)—rescue from inbreeding, create new varieties

– Endangered Species— figure out how endangered populations are related and how to perform genetic rescue 3

Page 4: Phylogenetics - University of Wisconsin–Madison

Tree of Life Web Projecthttp://www.tolweb.org/tree/

Page 5: Phylogenetics - University of Wisconsin–Madison

Tree of Life 2016Hug et al. 2016 Nature Microbiology

Bacteria

Archaea

Eukarya

Page 6: Phylogenetics - University of Wisconsin–Madison

Outline

1. What is a phylogeny?

2. How do you construct a phylogeny?The Molecular ClockStatistical Methods

Page 7: Phylogenetics - University of Wisconsin–Madison

Are Genetic Distancesand fossil recordroughly congruent?

Think about relationships among the major lineages of life and when they appeared in the fossil record

Page 8: Phylogenetics - University of Wisconsin–Madison

Fossil Record vs Molecular Clock

• Molecular clock and fossil record are not always congruent– Fossil record is incomplete, and soft bodied species are usually

not preserved– Mutation rates can vary among species (depending on

generation time, replication error, mismatch repair)

• But they provide complementary information– Fossil record contains extinct species, while molecular data is

based on extant taxa– Major events in fossil record could be used to calibrate the

molecular clock

Page 9: Phylogenetics - University of Wisconsin–Madison

Evolutionary History of HIV

Evolutionary AnalysisFreeman& Herron, 2004Time

HIV evolved multiple times from SIV (Simian Immunodeficiency Syndrome)

Page 10: Phylogenetics - University of Wisconsin–Madison

Charles Darwin (1809 -1882)

On the Origin of Species (1859)

– Living species are related by common ancestry

– Change through time occurs at the population not the organism level

– The main cause of adaptive evolution is natural selection

Page 11: Phylogenetics - University of Wisconsin–Madison

Darwin envisaged evolution as a tree

The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth………The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species…..….the great Tree of Life….covers the earth with ever-branching and beautiful ramifications

Charles Darwin, On the Origin of Species; pages 131-132

Page 12: Phylogenetics - University of Wisconsin–Madison

Reconstructing the Tree of Life

Page 13: Phylogenetics - University of Wisconsin–Madison

The only figure in The Origin of Species

Page 14: Phylogenetics - University of Wisconsin–Madison

Lamarck proposed a ladder of life

What did people believe before Darwin?

Past Future

Page 15: Phylogenetics - University of Wisconsin–Madison

Jean-Baptiste Lamarck

• French Naturalist (1744-1829)• “Professor of Worms and

Insects” in Paris

• The first scientific theory of evolution (inheritance of acquired traits)

Page 16: Phylogenetics - University of Wisconsin–Madison

Lamarck’s View of Evolution

• Continuum between physical and biological world (followed Aristotle)

• Scala Naturae (“Ladder of Life” or “Great Chain of Being”)

Being

Realm of Being

Realm of Becoming

Non-Being

God

Angels

Demons

ManAnimals

Plants

Minerals

Page 17: Phylogenetics - University of Wisconsin–Madison

What is wrong with a ladder?

• Evolution is not linear but branching

• Living organisms are not ancestors of one another

• The ladder implies progress

Page 18: Phylogenetics - University of Wisconsin–Madison

What is right with the tree?• Evolution is a branching process• If a mutation occurs, one species

is not turning into another, but there is a split, and both lineages continue to evolve

• So, evolution is not progressive -all living taxa are equally “successful”

• Phylogenies (Trees) reflect the hierarchical structuring of relationships

Page 19: Phylogenetics - University of Wisconsin–Madison

The only figure in The Origin of Species

Page 20: Phylogenetics - University of Wisconsin–Madison

The Tree of Life is a Fractal

Page 21: Phylogenetics - University of Wisconsin–Madison

Genealogical structures • Phylogeny

– A depiction of the ancestry relations between species (it includes speciation events)

– Tree-like (divergent)

• Pedigree– A depiction of the ancestry relations within

populations– Net-like (reticulating)

Page 22: Phylogenetics - University of Wisconsin–Madison

Four butterflies connected to their parents

offspring

parents

Page 23: Phylogenetics - University of Wisconsin–Madison

Popu

latio

nIn

divi

dual

s

past

futu

re

Page 24: Phylogenetics - University of Wisconsin–Madison

Popu

latio

nLi

neag

e/

Spec

ies

Phyl

ogen

y

What happened here?

Lineage-branchingSpeciation

Page 25: Phylogenetics - University of Wisconsin–Madison

What happened here?

Extinction

Page 26: Phylogenetics - University of Wisconsin–Madison

A B C

The True History

A B C

A simplified representation

Representation of phylogenies?

Page 27: Phylogenetics - University of Wisconsin–Madison

Some terms used to describe a phylogenetic tree

Taxon (taxa)Tip

Internal branchInternode

Node (Speciation event)

Root

Page 28: Phylogenetics - University of Wisconsin–Madison

Outline

1. What is a phylogeny?

2. How do you construct a phylogeny?The Molecular ClockStatistical Methods

Page 29: Phylogenetics - University of Wisconsin–Madison

• A phylogenetic tree represents a hypothesis about evolutionary relationships

• Each branch point represents the divergence of two taxa (e.g. species)

• Sister taxa are groups that share an immediate common ancestor

What is a Phylogeny?

Page 30: Phylogenetics - University of Wisconsin–Madison

Sistertaxa

ANCESTRALLINEAGE

Taxon A

Polytomy (unresolved branching point)

Common ancestor oftaxa A–F

Branch point(node)

Taxon B

Taxon C

Taxon D

Taxon E

Taxon F

Page 31: Phylogenetics - University of Wisconsin–Madison

Molecular Clock• Phylogenies rely on the “Molecular Clock,” namely the fact that Mutations on average, occur at a given rate

• So, on average, more mutational differences between taxa means that they branched from a common ancestor longer ago

• So longer branches on phylogeny often à greater evolutionary distance

Example:Mitochondria: 1 mutation every ~2.2%/million years 31

Page 32: Phylogenetics - University of Wisconsin–Madison
Page 33: Phylogenetics - University of Wisconsin–Madison

Phylogeny of 53 humans (Homo sapiens) just based on mtDNA

A different locus might yield a different tree

The horizontal branch lengths reflect genetic distance ≈ # of mutations

Page 34: Phylogenetics - University of Wisconsin–Madison

Cladogram of mitochondrial cytochrome oxidase II alleles in humans and the African Great Apes (Ruvolo et al. 1994)

A cladogram shows the hierarchical relationships among the taxa, but the branch lengths do not reflect evolutionary time.

This is not a phylogeny, but a cladogram.

Page 35: Phylogenetics - University of Wisconsin–Madison

Problem: mutation rate can vary among species

• Mutation rate is faster:– Shorter generation time

(greater number of meiosis or mitosis events in a given time)

– Replication Error (e.g. Sloppy DNA or RNA polymerase; poor mismatch repair mechanisms)

Molecular Clock

35

Page 36: Phylogenetics - University of Wisconsin–Madison

Species

Canislupus

Pantherapardus

Taxideataxus

Lutra lutra

Canislatrans

Order Family Genus

Carnivora

FelidaeM

ustelidaeC

anidae

Canis

LutraTaxidea

Panthera

Page 37: Phylogenetics - University of Wisconsin–Madison

A A A

BBB

C C C

DDD

E E E

FFF

G G G

Group IIIGroup II

Group I

(a) Monophyletic group (clade) (b) Paraphyletic group (c) Polyphyletic group

A monophyletic clade consists of an ancestral taxa and all its descendants

Page 38: Phylogenetics - University of Wisconsin–Madison

A

B

CD

E

F

G

Group I

(a) Monophyletic group (clade)

(In the lecture on species concepts we discussed that the “smallest” monophyletic group is a “phylogenetic species”)

Page 39: Phylogenetics - University of Wisconsin–Madison

Examples of Paraphyletic Groups(not recognized as legitimate groups in the Phylogenetic Species Concept, which only recognizes monophyletic groups)

Page 40: Phylogenetics - University of Wisconsin–Madison
Page 41: Phylogenetics - University of Wisconsin–Madison
Page 42: Phylogenetics - University of Wisconsin–Madison

Synapomorphies

• Synapomorphies are shared derived homologous traits

• They can be DNA nucleotides or other heritable traits

• They are used to group taxa that are more closely related to one another

Page 43: Phylogenetics - University of Wisconsin–Madison
Page 44: Phylogenetics - University of Wisconsin–Madison
Page 45: Phylogenetics - University of Wisconsin–Madison
Page 46: Phylogenetics - University of Wisconsin–Madison

synapomorphies

Page 47: Phylogenetics - University of Wisconsin–Madison
Page 48: Phylogenetics - University of Wisconsin–Madison
Page 49: Phylogenetics - University of Wisconsin–Madison
Page 50: Phylogenetics - University of Wisconsin–Madison

Sometimes similar looking traits are not homologous, and are not synapomorphies, but are the result of convergent evolution

Page 51: Phylogenetics - University of Wisconsin–Madison
Page 52: Phylogenetics - University of Wisconsin–Madison
Page 53: Phylogenetics - University of Wisconsin–Madison

How do we construct Phylogenies?

Page 54: Phylogenetics - University of Wisconsin–Madison

Phylogenetic Methods

• Parsimony: Minimize # steps

• Distance Matrix: minimize pairwise genetic distances

• Maximum Likelihood: Probability of the data given the tree

• Bayesian: Probability of the tree given the data

Page 55: Phylogenetics - University of Wisconsin–Madison

Parsimony Uses DiscreteCharacters (like mutations, or some heritable trait)

Select the tree with the minimum number of character-state transitions summed across all characters

Page 56: Phylogenetics - University of Wisconsin–Madison

Fig. 26-15-1

Species I

Three phylogenetic hypotheses:

Species II Species III

I

II

III

I

III

IIIII

III

Parsimony: Example 1

Page 57: Phylogenetics - University of Wisconsin–Madison

Fig. 26-15-2

Species I

Site

Species II

Species III

I

II

III

I

III

IIIII

III

Ancestralsequence

1/C1/C

1/C

1/C

1/C

4321

C

C C

C

T

T

T

T

T

T A

AA

A G

G

Page 58: Phylogenetics - University of Wisconsin–Madison

Fig. 26-15-3

Species I

Site

Species II

Species III

I

II

III

I

III

IIIII

III

Ancestralsequence

1/C1/C

1/C

1/C

1/C

4321

C

C C

C

T

T

T

T

T

T A

AA

A G

G

I I

I

II

II

II

III

III

III3/A

3/A

3/A3/A

3/A

2/T2/T

2/T 2/T

2/T4/C

4/C

4/C

4/C

4/C

Page 59: Phylogenetics - University of Wisconsin–Madison

Fig. 26-15-4

Species I

Site

Species II

Species III

I

II

III

I

III

IIIII

III

Ancestralsequence

1/C1/C

1/C

1/C

1/C

4321

C

C C

C

T

T

T

T

T

T A

AA

A G

G

I I

I

II

II

II

III

III

III3/A

3/A

3/A3/A

3/A

2/T2/T

2/T 2/T

2/T4/C

4/C

4/C

4/C

4/C

I I

I

II

II

II

III

III

III

7 events7 events6 events

Page 60: Phylogenetics - University of Wisconsin–Madison

Three possible trees

Tree 1

C B

AO

Tree 2

A B

CO

B C

AO

C B AO A B CO

B A CO

Tree 3

Parsimony: Example 2

Page 61: Phylogenetics - University of Wisconsin–Madison

C B AO

Map the characters (mutations) onto tree 1

12

ABC

O1 2 3 4 5

GGG

T G G A A

CCC

GAA

AAC

AAT

Page 62: Phylogenetics - University of Wisconsin–Madison

Map the characters (mutations) onto tree 1

ABC

O1 2 3 4 5

GGG

T G G A A

CCC

GAA

AAC

AAT

Total # number of steps = 6

C B AO

12

3

3

45

Page 63: Phylogenetics - University of Wisconsin–Madison

Actually, there is more than one way to map character 3

ABC

O

3

G

GAA

Either way the character contributes 2 steps to the overall tree length

C B AO

3

3

C B AO

3

3

Page 64: Phylogenetics - University of Wisconsin–Madison

Map the characters onto tree 2

# steps = 5

ABC

O

1 2 3 4 5

GGG

T G G A A

CCC

GAA

AAC

AAT

A B CO

12

45

3

Page 65: Phylogenetics - University of Wisconsin–Madison

Tree 3

Length = 6 steps

ABC

O

1 2 3 4 5

GGG

T G G A A

CCC

GAA

AAC

AAT

B A CO

12

453

3

Page 66: Phylogenetics - University of Wisconsin–Madison

Most parsimonious tree

Which tree had the shortest branch lengths (most parsimonious)?

Tree 1: length = 6

C B

AO

Tree 2: length = 5

B C

AO

C B AO A B CO

B A CO

Tree 3: length = 6

Page 67: Phylogenetics - University of Wisconsin–Madison

Example from Freeman & Herron, Fig. 4.8

Where do the Whales belong?

Page 68: Phylogenetics - University of Wisconsin–Madison

Freeman & Herron, Fig. 4.9: Using maximum parsimony, looks like the whales cluster with the hippos (and cows)

Page 69: Phylogenetics - University of Wisconsin–Madison

Parsimony• Simplest and fastest method of phylogenetic

reconstruction

• Can give misleading results if rates of evolution (rates that mutations occur) differ in different lineages

• Tends to become less accurate as genetic distances get greater

• Could be mislead by reversals, homoplasy: Because with only 4 nucleotides, after a while, same mutations occur repeatedly at a given site (called “saturation”) – “multiple hits (mutations) per site”

Page 70: Phylogenetics - University of Wisconsin–Madison

Distance MatrixContinuous or

Discrete Characters

Page 71: Phylogenetics - University of Wisconsin–Madison

Distance Matrix

• Calculate pairwise distances between taxa• Choose the tree that minimizes overall

distances between taxa

proportion sequence distance at 2 genes(hypothetical data)

mouse cat dog dolphin seal

MouseCat 0.05Dog 0.03 0.02Dolphin 0.08 0.15 0.03Seal 0.09 0.23 0.01 0.02

Page 72: Phylogenetics - University of Wisconsin–Madison

Freeman & Herron, Fig. 4.10: Using genetic distances, looks like the whales again cluster with the hippos (and cows)

Page 73: Phylogenetics - University of Wisconsin–Madison

Distance Matrix

• Generally more accurate than parsimony

• Like parsimony, it tends to be computationally fast

Page 74: Phylogenetics - University of Wisconsin–Madison

Z: Probability of the data

Maximum Likelihood (R.A. Fisher)• Probability of the data given the tree• This is a “Frequentist” method: one true answer

(one true tree)

• Draw from the data (probability distribution of DNA sequence data) to find the true tree

• Choose the tree (x, y axis) that maximizes the probability of the observed data (z axis)

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 17(6):368-76.

x,y: Tree space

Page 75: Phylogenetics - University of Wisconsin–Madison

Z: Probability of the data

Maximum Likelihood (R.A. Fisher)• Probability of the data given the tree• The aim of maximum likelihood estimation is to find

the parameter value(s) that makes the observed data most likely.

• For example: finding a mean. If you want to have a number that describes the data, like human height, you could find the mean

P(data/tree) = likelihood(tree/data)Tree = hypothesis

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 17(6):368-76.

x,y: Tree space

Page 76: Phylogenetics - University of Wisconsin–Madison

• Often yields more accurate tree than parsimony or distance

• Relies on an accurate assumption of which mutations are more probable (A->G more often than A->T or C? i.e. accurate model of molecular

evolution)

• Computationally intensive

Maximum Likelihood(R.A. Fisher)

Page 77: Phylogenetics - University of Wisconsin–Madison

Bayesian InferenceReverend Thomas Bayes (1702-1760)

• Probability of a tree given the data• Uses prior information on the tree• Does not assume that there is one correct tree• Will modify estimate based on additional information

• Uses Bayes’ Theorem

P(A/B) = P(B/A)P(A)P(B)

Page 78: Phylogenetics - University of Wisconsin–Madison

Bayesian InferenceReverend Thomas Bayes (1702-1760)

• Uses Bayes’ Theorem

P(A/B) = P(B/A)P(A) = P(tree/data) = P(data/tree)P(tree)P(B) P(data)

P(A) = prior probability, probability of a treeP(A/B) = posterior probability—probably of tree given the dataP(B/A) = the probability B (data) of observing given A (tree), is also known as the likelihood. It indicates the compatibility of the evidence with the given hypothesis.P(B) = probability of the data

Page 79: Phylogenetics - University of Wisconsin–Madison

Bayesian InferenceReverend Thomas Bayes (1702-1760)

• Probability of a tree given the data:

• Will modify estimate based on additional information: so as you get more data, you update your hypothesis for the tree

• Uses prior information on the tree: this is where you start

• The sequential use of the Bayes' formula (recursive): when more data become available after calculating a posterior distribution, the posterior becomes the next prior

• Does not assume that there is one correct tree

Page 80: Phylogenetics - University of Wisconsin–Madison

Bayesian Inference

• Like Likelihood, often yields more accurate tree than parsimony or distance

• Computationally more intensive than parsimony or distance matrix, but less intensive than likelihood

• Needs a prior probability for the tree and a model of evolution

Page 81: Phylogenetics - University of Wisconsin–Madison

• Sufficient Amount of Data: – With enough data most statistical methods usually

yield the same tree (but not always—sometimes there is no single resolved tree)

– Insufficient data would yield a tree that lacks resolution (lacks statistical power)

• Gene trees vs species trees– Evolutionary history of individual genes are not

necessarily the same– Should try to get data from many genes, or the whole

genome

Potential problems of Phylogenetic Reconstruction

Page 82: Phylogenetics - University of Wisconsin–Madison

Challenges of Phylogenetic Reconstructions

• Different parts of the genome might have different evolutionary histories (different gene genealogies, horizontal gene transfers, allopolyploidy, etc)

• So, there might not be one true tree for a group of taxa, and relationships might be difficult to resolve because they are inherently complex

Page 83: Phylogenetics - University of Wisconsin–Madison

• Current trend is to use whole genome data to reconstruct phylogenies

• Gain a comprehensive picture of the evolutionary relationships among taxa for the whole genome

Page 84: Phylogenetics - University of Wisconsin–Madison

• Typically, evolutionary biologists will use a variety of methods to reconstruct a phylogeny. • Maximum likelihood and Bayesian methods are considered

more robust.

• Tree is only as good as the data. Having many homoplastic characters (due to convergent evolution, reversals, etc.) will make the reconstruction less robust• Standard to use Bootstrapping to assess the validity of the

tree

• Understanding statistics is fundamental to understanding evolution• Much of statistics was in fact developed in order to model

evolutionary processes (such as ANOVA, analysis of variance)

Phylogenetic Reconstructions

Page 85: Phylogenetics - University of Wisconsin–Madison

1. Sometimes the Molecular Clock (based on genetic data) conflicts with the Geological Record. Why would this happen?

(A) Sometimes there are gaps in the geological record, because fossils do not form everywhere, and mutation rate might vary between different species

(B) Radiometric dating relies on chance events in the preservation of isotopes, making the timing events in the geological time scale less accurate than the molecular clock

(C) Mutation rates slow down as you go back in time, making estimation of timing of events less accurate as you go back in time

(D) The molecular clock is calculated from radioisotopes, while the geological record is obtained from fossil data. The two can conflict when fossils end up displaced from their original sedimentary layer

Page 86: Phylogenetics - University of Wisconsin–Madison

2. You are a medical researcher working on HIV. A novel strain has appeared in Madison, Wisconsin. To determine which drugswould be most effective in treating this new strain (because different strains are resistant to different drugs), you need to determine its recent evolutionary history. You decide to reconstruct the evolutionary history of HIV by using a phylogenetic approach. Thus, you collect samples from patients in various geographic locations and sequence a fragment of RNA. Using parsimony, which is thecorrect phylogeny for HIV-1 based on the data below?

HIV-1, Uganda, Africa ACAUGHIV-1, San Francisco, USA UGAUGHIV-1, Madison, USA UAAGGHIV-1, New York, USA UAAAGHIV-1, Paris ACAUCHIV-2 Africa (ancestral outgroup): ACCUG

Page 87: Phylogenetics - University of Wisconsin–Madison

3. Which of the following is most TRUE regarding phylogenetic reconstructions?

(A) Phylogenetic reconstruction based on any gene would yield the same tree

(B) Parsimony is the most accurate method for reconstructing phylogenies

(C) Phylogenetic reconstructions based on different genes could yield different phylogenetic trees

(D) Maximum likelihood relies on maximizing distances among taxa

(E) There is always one true tree, and having enough genetic data will inevitably result in one tree

Page 88: Phylogenetics - University of Wisconsin–Madison

Answers

• 1A• 2C• 3C