33
Loss of information at deeper times (and the origin of proteins?) David Penny Brisbane July 2014 The mathematicos caused the problem!!! Now they should solve it! Okay, maybe we could help them, Here are some ideas And the origins of protein synthesis

David Penny - Loss of information at deeper divergences, and what we can do about it

Embed Size (px)

DESCRIPTION

It has been shown by Mossel and Steel (2004) that simple Markov models lose information at the deepest divergences (say, greater than 400 million years ago); and that the fall‐off is exponential at deeper times. However, that does not mean that there is no information left; for example, the three‐dimensional structure of proteins should still retain information about deeper divergences, although we may not yet know how to use that information. Biologists still want to estimate the deeper divergences and thus it is a significant question to find additional sources of information. Several suggestions are offered that require a more formal analysis. Firstly, we probably expect that where there is a real Gamma distribution of rates, information may be retained for longer. Secondly, if there is really a bimodal distribution of rates, then identifying, and eliminating these faster‐evolving sites should help. Thirdly, the inference of ancestral sequences at deeper divergences appears quite robust, and there is some evidence that this may help recover deeper divergences. Fourthly, it is increasingly possible to infer three‐dimensional structures, and these should retain information longer. Fifthly, there may be differences between the loop regions of Akaryote and Eukaryote proteins, and only taking the regions crossing the central 3D region might help. Sixthly, an approach of weighting, not of characters, but of the partitions they are consistent with, might help. Seventhly, possibly gene order information might be helpful. Several examples of such approaches will be presented, and a challenge issued to theoreticians to solve some of these fundamental issues. There is still a lot to learn about protein evolution. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Citation preview

Page 1: David Penny - Loss of information at deeper divergences, and what we can do about it

Loss of information at deeper times (and the origin of proteins?)

David Penny Brisbane July 2014

The mathematicos caused the problem!!! Now they should solve it!

Okay, maybe we could help them, Here are some ideas

And the origins of protein synthesis

Page 2: David Penny - Loss of information at deeper divergences, and what we can do about it

the comfort zone

ML Int

ML Rel Mlav ML

MLan MP ML

MLep MP MP

popn classic phylogeny deep phylogeny

Page 3: David Penny - Loss of information at deeper divergences, and what we can do about it

can we go further back

in time?

Markov models - Loss of information

Mossel and Steel 2004-5

Page 4: David Penny - Loss of information at deeper divergences, and what we can do about it

damned eukaryotes!

Page 5: David Penny - Loss of information at deeper divergences, and what we can do about it

fungamals

Fred or LECA

Animals

Fungi

Microsporidia

One Eukaryotic Tree Plantae

Plants Green Algae

Red Algae

Amoebozoa

Excavates

Diplomonads Parabasalids

Euglenozoa Heterolobosea

Rhizaria Radiolaria

Cercozoa

Chromalveolates

Alveolates

Stramenopiles

What is common to all groups of modern (extant) eukaryotes? We have pretty good data. We can get solid evidence.

crown group

Page 6: David Penny - Loss of information at deeper divergences, and what we can do about it

Calculated results, Δ ≤ ¼ + ne-qt

-0.2

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000

0.01 0.005 0.002 0.001

Page 7: David Penny - Loss of information at deeper divergences, and what we can do about it

0%

20%

40%

60%

80%

100%

120%

0.1 1 10

pe

rce

nta

ge

of

tre

es

co

rre

ctd=0.001d=0.100d=0.500d=1.000d=2.000d=5.000infinite

idea 1. simulations (covarion model)

Page 8: David Penny - Loss of information at deeper divergences, and what we can do about it

number of internal edges correct, out of 6neighbor joining, 9 taxa, 1000 columns, i.i.d.

00.5

1

5 8 13 20 32 50 80 125

200

320

500

790

1250

2000

millions of years (log scale)

6

5

4

3

2

1

0

simulation results with standard model

Page 9: David Penny - Loss of information at deeper divergences, and what we can do about it

idea 2, delete fast sites

If there were a mixture of a) faster evolving sites, and b) and we could identify them c) and remove them would that help go further back in time?

Page 10: David Penny - Loss of information at deeper divergences, and what we can do about it

deleting faster sites

Presenter
Presentation Notes
Pearson correlation results. The blue line indicates the Pearson correlation coefficient (r) of the ML distance calculated from “A” (more conserved) and “B” (less conserved) partitions. The red line indicates the r value of uncorrected p-distances and ML distances for B partitions. The r values begin to increase significantly at 31,136 sites remaining and this is taken to indicate that the assumed model of nucleotide evolution is beginning to fit the data well.
Page 11: David Penny - Loss of information at deeper divergences, and what we can do about it

Ancestral Sequence Reconstruction

Giardia animals plants

(idea 3)

Page 12: David Penny - Loss of information at deeper divergences, and what we can do about it

3,4 testing

Ancestral Sequence Reconstr-

uction

vaults 3-D info

Page 13: David Penny - Loss of information at deeper divergences, and what we can do about it

subgroups X and Y

a b c d e k l m n o

ax ay

subgroup X subgroup Y

Page 14: David Penny - Loss of information at deeper divergences, and what we can do about it

chloroplast vs nuclear

Data Type

Group X Group Y Divergence Times

X² d.f. p(X²) X²(control)

p(X²)(control)

Chloroplast

Eudicot Monocot 125mya 289.058 102 1.94E-19 93.690 7.09E-01

Chloroplast

Angiosperm

Gymnosperm 305mya 363.527 104 1.23E-29 85.647 9.05E-01

Chloroplast

Seed plant Fern 390mya 457.118 102 1.69E-44 100.451

5.25E-01

Chloroplast

Streptophyta

Chlorophyta 700mya 300.162 94 2.23E-23 90.982 5.69E-01

Chloroplast

Red Algae Green Algae ~ 1000mya 341.014 82 2.60E-34 70.928 8.03E-01

Chloroplast

Algae Cyanobacteria ~ 1500mya 231.079 56 3.90E-23 62.718 2.50E-01

Nuclear Algae Cyanobacteria ~ 1500mya 70.479 34 2.30E-04 39.342 2.43E-01

Page 15: David Penny - Loss of information at deeper divergences, and what we can do about it

4. gene length vs similarity

Page 16: David Penny - Loss of information at deeper divergences, and what we can do about it

5. should we emphasize conserved residues?

(or surface ones)

Page 17: David Penny - Loss of information at deeper divergences, and what we can do about it

f1 a . . . . a . . . . . . f2 . . . . . a . . . . a . f3 . . . . . a . . . . . a f4 . . . . . a . . . . . . g1 . a a a a . a a . . . . g2 a a a a a . . a . . a . h1 . . a a a . . . a a . . h2 . . a . a . . . a a . . h3 a . . a a . . . a a . .

i j 3 4 5 6 7 8 9 0 1 2

i j i j . . . a

a . a a

upper bound = 17

lower bound = 12

? 13

Would weighting by incompatibilities

help?

6, Weighting

Page 18: David Penny - Loss of information at deeper divergences, and what we can do about it

7, information from sequence order not used Alignment Reordered Alignment

original sequence order shuffled/reordered AIIFLNSALGPSPELFPIILATKVL ASAGPSPPATPLLIIIILLFFNEKV AIMFLNSALGPPTELFPVILATKVL ASAGPPTPATPLLIMVILLFFNEKV SIMFLNHTLNPTPELFPIILATETL SHTNPTPPATPLLIMIILLFFNEET TILFLNSSLGLQPEVTPTVLATKTL TSSGLQPPATPLLILTVLVTFNEKT TLLFLNSMLKPPSELFPIILATKTL TSMKPPSPATPLLLLIILLFFNEKT ALLFLNSTLNPPTELFPLILATKTL ASTNPPTPATPLLLLLILLFFNEKT AILFLNSFLNPPKEFFPIILATKIL ASFNPPKPATPLLILIILFFFNEKI

c! ways to reorder alignment shuffle by columns & by taxa

8. could we use ‘words’ of 2, 3, 4, 5, … letters

9. Alphabet reduction?

Page 19: David Penny - Loss of information at deeper divergences, and what we can do about it

damned eukaryotes!

Page 20: David Penny - Loss of information at deeper divergences, and what we can do about it

limits of evolutionary mechanisms - no miracles

continuity all intermediates ‘functional’ can’t evolve “for” what doesn’t exist

Protein synthesis? mRNA, tRNAs, rRNAs, triplet code - why 3?

the origin of protein synthesis?

Page 21: David Penny - Loss of information at deeper divergences, and what we can do about it

Eigen limit 1

master sequence

Page 22: David Penny - Loss of information at deeper divergences, and what we can do about it

Eigen limit 2

mutation

selection 0

Page 23: David Penny - Loss of information at deeper divergences, and what we can do about it

Eigen limit 3

mutation

selection 0

Page 24: David Penny - Loss of information at deeper divergences, and what we can do about it

Eigen limit 4

mutation

selection 0

~1 error per replication,

error catastrophe,

mutational meltdown

Page 25: David Penny - Loss of information at deeper divergences, and what we can do about it

error rates

the error rate limits the length that can be copied, Manfred Eigen (1971) 2 errors in 37 copies of hammerhead ribozyme, 20-fold improvement

Page 26: David Penny - Loss of information at deeper divergences, and what we can do about it

ribavirin and polio viruses

Crotty et al. PNAS 98, 6895 2001

Page 27: David Penny - Loss of information at deeper divergences, and what we can do about it

synthesis of RNA

from cyclic GTPs

Page 28: David Penny - Loss of information at deeper divergences, and what we can do about it

hydrolysis ↔ polymerisation

+H3N C C N -C C O- + H2O

O H O H

R1 R2

H3N+CC O- + H3N+C C O-

O H

R}

O H

R2

α α α α

two monomers ↔ one dimer (+H2O).

heat amino acids dry, or drying cycles (fluctuating clay environment) or frozen in ice?

Page 29: David Penny - Loss of information at deeper divergences, and what we can do about it

RNA evolution - in vitro

Page 30: David Penny - Loss of information at deeper divergences, and what we can do about it

enzyme efficiency RNA ⇒ RNP ⇒ protein

CATALYST Kcat

(min-1

) Kcat/Km

(M-1

min-1

)

RNA Tetrahymena L-21(SacI) polynucleotide kinase RNase P RNA

0.1 0.3

1

9.0 x 107 6.0 x 103 2.0 x 106

RNP RNase P RNA + protein 2 4.0 x 106 protein RNase T1

T4 polynucleotide kinase triose-P isomerase carbonic anhydrase

5,700 25,000

258,000 600,000,000

1.1 x 108 6.0 x 108 1.4 x 1010 7.2 x 109

RNA copied by protein, error rate is high

Page 31: David Penny - Loss of information at deeper divergences, and what we can do about it

origin of protein synthesis??

NNN

N

N-TP

A

B

C

NNN

How long would a G=C pairing last?

Page 32: David Penny - Loss of information at deeper divergences, and what we can do about it

origin of protein synthesis?

D

new RNA strand

template strand

ribozyme-catalysed decoding, cleavage and

ligation functions

A C C

aa+

aa+

aa+

activated tRNA

activated tRNA

activated tRNA

inactivated tRNA

A C C

NNN

A C C

1 2

3

Page 33: David Penny - Loss of information at deeper divergences, and what we can do about it

predictions

Theoretical/computational time of (say) GC binding is it increased by heavy RNAs di and trinucleotides Experimental length of RNA copied (dinucleotides) amino acid codes???