David Penny - Loss of information at deeper divergences, and what we can do about it

Loss of information at deeper times (and the origin of proteins?)

David Penny Brisbane July 2014

The mathematicos caused the problem!!! Now they should solve it!

Okay, maybe we could help them, Here are some ideas

And the origins of protein synthesis

the comfort zone

ML Int

ML Rel Mlav ML

MLan MP ML

MLep MP MP

popn classic phylogeny deep phylogeny

can we go further back

in time?

Markov models - Loss of information

Mossel and Steel 2004-5

damned eukaryotes!

fungamals

Fred or LECA

Animals

Fungi

Microsporidia

One Eukaryotic Tree Plantae

Plants Green Algae

Red Algae

Amoebozoa

Excavates

Diplomonads Parabasalids

Euglenozoa Heterolobosea

Rhizaria Radiolaria

Cercozoa

Chromalveolates

Alveolates

Stramenopiles

What is common to all groups of modern (extant) eukaryotes? We have pretty good data. We can get solid evidence.

crown group

Calculated results, Δ ≤ ¼ + ne-qt

-0.2

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000

0.01 0.005 0.002 0.001

0%

20%

40%

60%

80%

100%

120%

0.1 1 10

pe

rce

nta

ge

of

tre

es

co

rre

ctd=0.001d=0.100d=0.500d=1.000d=2.000d=5.000infinite

idea 1. simulations (covarion model)

number of internal edges correct, out of 6neighbor joining, 9 taxa, 1000 columns, i.i.d.

00.5

1

5 8 13 20 32 50 80 125

200

320

500

790

1250

2000

millions of years (log scale)

6

5

4

3

2

1

0

simulation results with standard model

idea 2, delete fast sites

If there were a mixture of a) faster evolving sites, and b) and we could identify them c) and remove them would that help go further back in time?

deleting faster sites

Presenter

Presentation Notes

Pearson correlation results. The blue line indicates the Pearson correlation coefficient (r) of the ML distance calculated from “A” (more conserved) and “B” (less conserved) partitions. The red line indicates the r value of uncorrected p-distances and ML distances for B partitions. The r values begin to increase significantly at 31,136 sites remaining and this is taken to indicate that the assumed model of nucleotide evolution is beginning to fit the data well.

Ancestral Sequence Reconstruction

Giardia animals plants

(idea 3)

3,4 testing

Ancestral Sequence Reconstr-

uction

vaults 3-D info

subgroups X and Y

a b c d e k l m n o

ax ay

subgroup X subgroup Y

chloroplast vs nuclear

Data Type

Group X Group Y Divergence Times

X² d.f. p(X²) X²(control)

p(X²)(control)

Chloroplast

Eudicot Monocot 125mya 289.058 102 1.94E-19 93.690 7.09E-01

Chloroplast

Angiosperm

Gymnosperm 305mya 363.527 104 1.23E-29 85.647 9.05E-01

Chloroplast

Seed plant Fern 390mya 457.118 102 1.69E-44 100.451

5.25E-01

Chloroplast

Streptophyta

Chlorophyta 700mya 300.162 94 2.23E-23 90.982 5.69E-01

Chloroplast

Red Algae Green Algae ~ 1000mya 341.014 82 2.60E-34 70.928 8.03E-01

Chloroplast

Algae Cyanobacteria ~ 1500mya 231.079 56 3.90E-23 62.718 2.50E-01

Nuclear Algae Cyanobacteria ~ 1500mya 70.479 34 2.30E-04 39.342 2.43E-01

4. gene length vs similarity

5. should we emphasize conserved residues?

(or surface ones)

f1 a . . . . a . . . . . . f2 . . . . . a . . . . a . f3 . . . . . a . . . . . a f4 . . . . . a . . . . . . g1 . a a a a . a a . . . . g2 a a a a a . . a . . a . h1 . . a a a . . . a a . . h2 . . a . a . . . a a . . h3 a . . a a . . . a a . .

i j 3 4 5 6 7 8 9 0 1 2

i j i j . . . a

a . a a

upper bound = 17

lower bound = 12

? 13

Would weighting by incompatibilities

help?

6, Weighting

7, information from sequence order not used Alignment Reordered Alignment

original sequence order shuffled/reordered AIIFLNSALGPSPELFPIILATKVL ASAGPSPPATPLLIIIILLFFNEKV AIMFLNSALGPPTELFPVILATKVL ASAGPPTPATPLLIMVILLFFNEKV SIMFLNHTLNPTPELFPIILATETL SHTNPTPPATPLLIMIILLFFNEET TILFLNSSLGLQPEVTPTVLATKTL TSSGLQPPATPLLILTVLVTFNEKT TLLFLNSMLKPPSELFPIILATKTL TSMKPPSPATPLLLLIILLFFNEKT ALLFLNSTLNPPTELFPLILATKTL ASTNPPTPATPLLLLLILLFFNEKT AILFLNSFLNPPKEFFPIILATKIL ASFNPPKPATPLLILIILFFFNEKI

c! ways to reorder alignment shuffle by columns & by taxa

8. could we use ‘words’ of 2, 3, 4, 5, … letters

9. Alphabet reduction?

damned eukaryotes!

limits of evolutionary mechanisms - no miracles

continuity all intermediates ‘functional’ can’t evolve “for” what doesn’t exist

Protein synthesis? mRNA, tRNAs, rRNAs, triplet code - why 3?

the origin of protein synthesis?

Eigen limit 1

master sequence

Eigen limit 2

mutation

selection 0

Eigen limit 3

mutation

selection 0

Eigen limit 4

mutation

selection 0

~1 error per replication,

error catastrophe,

mutational meltdown

error rates

the error rate limits the length that can be copied, Manfred Eigen (1971) 2 errors in 37 copies of hammerhead ribozyme, 20-fold improvement

ribavirin and polio viruses

Crotty et al. PNAS 98, 6895 2001

synthesis of RNA

from cyclic GTPs

hydrolysis ↔ polymerisation

+H3N C C N -C C O- + H2O

O H O H

R1 R2

H3N+CC O- + H3N+C C O-

O H

R}

O H

R2

α α α α

two monomers ↔ one dimer (+H2O).

heat amino acids dry, or drying cycles (fluctuating clay environment) or frozen in ice?

↔

RNA evolution - in vitro

enzyme efficiency RNA ⇒ RNP ⇒ protein

CATALYST Kcat

(min-1

) Kcat/Km

(M-1

min-1

)

RNA Tetrahymena L-21(SacI) polynucleotide kinase RNase P RNA

0.1 0.3

1

9.0 x 107 6.0 x 103 2.0 x 106

RNP RNase P RNA + protein 2 4.0 x 106 protein RNase T1

T4 polynucleotide kinase triose-P isomerase carbonic anhydrase

5,700 25,000

258,000 600,000,000

1.1 x 108 6.0 x 108 1.4 x 1010 7.2 x 109

RNA copied by protein, error rate is high

origin of protein synthesis??

NNN

N

N-TP

A

B

C

NNN

How long would a G=C pairing last?

origin of protein synthesis?

D

new RNA strand

template strand

ribozyme-catalysed decoding, cleavage and

ligation functions

A C C

aa+

aa+

aa+

activated tRNA

activated tRNA

activated tRNA

inactivated tRNA

A C C

NNN

A C C

1 2

3

predictions

Theoretical/computational time of (say) GC binding is it increased by heavy RNAs di and trinucleotides Experimental length of RNA copied (dinucleotides) amino acid codes???

Science

David Penny - Loss of information at deeper divergences, and what we can do about it