View
570
Download
3
Embed Size (px)
DESCRIPTION
It has been shown by Mossel and Steel (2004) that simple Markov models lose information at the deepest divergences (say, greater than 400 million years ago); and that the fall‐off is exponential at deeper times. However, that does not mean that there is no information left; for example, the three‐dimensional structure of proteins should still retain information about deeper divergences, although we may not yet know how to use that information. Biologists still want to estimate the deeper divergences and thus it is a significant question to find additional sources of information. Several suggestions are offered that require a more formal analysis. Firstly, we probably expect that where there is a real Gamma distribution of rates, information may be retained for longer. Secondly, if there is really a bimodal distribution of rates, then identifying, and eliminating these faster‐evolving sites should help. Thirdly, the inference of ancestral sequences at deeper divergences appears quite robust, and there is some evidence that this may help recover deeper divergences. Fourthly, it is increasingly possible to infer three‐dimensional structures, and these should retain information longer. Fifthly, there may be differences between the loop regions of Akaryote and Eukaryote proteins, and only taking the regions crossing the central 3D region might help. Sixthly, an approach of weighting, not of characters, but of the partitions they are consistent with, might help. Seventhly, possibly gene order information might be helpful. Several examples of such approaches will be presented, and a challenge issued to theoreticians to solve some of these fundamental issues. There is still a lot to learn about protein evolution. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/
Citation preview
Loss of information at deeper times (and the origin of proteins?)
David Penny Brisbane July 2014
The mathematicos caused the problem!!! Now they should solve it!
Okay, maybe we could help them, Here are some ideas
And the origins of protein synthesis
the comfort zone
ML Int
ML Rel Mlav ML
MLan MP ML
MLep MP MP
popn classic phylogeny deep phylogeny
can we go further back
in time?
Markov models - Loss of information
Mossel and Steel 2004-5
damned eukaryotes!
fungamals
Fred or LECA
Animals
Fungi
Microsporidia
One Eukaryotic Tree Plantae
Plants Green Algae
Red Algae
Amoebozoa
Excavates
Diplomonads Parabasalids
Euglenozoa Heterolobosea
Rhizaria Radiolaria
Cercozoa
Chromalveolates
Alveolates
Stramenopiles
What is common to all groups of modern (extant) eukaryotes? We have pretty good data. We can get solid evidence.
crown group
Calculated results, Δ ≤ ¼ + ne-qt
-0.2
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
0.01 0.005 0.002 0.001
0%
20%
40%
60%
80%
100%
120%
0.1 1 10
pe
rce
nta
ge
of
tre
es
co
rre
ctd=0.001d=0.100d=0.500d=1.000d=2.000d=5.000infinite
idea 1. simulations (covarion model)
number of internal edges correct, out of 6neighbor joining, 9 taxa, 1000 columns, i.i.d.
00.5
1
5 8 13 20 32 50 80 125
200
320
500
790
1250
2000
millions of years (log scale)
6
5
4
3
2
1
0
simulation results with standard model
idea 2, delete fast sites
If there were a mixture of a) faster evolving sites, and b) and we could identify them c) and remove them would that help go further back in time?
deleting faster sites
Ancestral Sequence Reconstruction
Giardia animals plants
(idea 3)
3,4 testing
Ancestral Sequence Reconstr-
uction
vaults 3-D info
subgroups X and Y
a b c d e k l m n o
ax ay
subgroup X subgroup Y
chloroplast vs nuclear
Data Type
Group X Group Y Divergence Times
X² d.f. p(X²) X²(control)
p(X²)(control)
Chloroplast
Eudicot Monocot 125mya 289.058 102 1.94E-19 93.690 7.09E-01
Chloroplast
Angiosperm
Gymnosperm 305mya 363.527 104 1.23E-29 85.647 9.05E-01
Chloroplast
Seed plant Fern 390mya 457.118 102 1.69E-44 100.451
5.25E-01
Chloroplast
Streptophyta
Chlorophyta 700mya 300.162 94 2.23E-23 90.982 5.69E-01
Chloroplast
Red Algae Green Algae ~ 1000mya 341.014 82 2.60E-34 70.928 8.03E-01
Chloroplast
Algae Cyanobacteria ~ 1500mya 231.079 56 3.90E-23 62.718 2.50E-01
Nuclear Algae Cyanobacteria ~ 1500mya 70.479 34 2.30E-04 39.342 2.43E-01
4. gene length vs similarity
5. should we emphasize conserved residues?
(or surface ones)
f1 a . . . . a . . . . . . f2 . . . . . a . . . . a . f3 . . . . . a . . . . . a f4 . . . . . a . . . . . . g1 . a a a a . a a . . . . g2 a a a a a . . a . . a . h1 . . a a a . . . a a . . h2 . . a . a . . . a a . . h3 a . . a a . . . a a . .
i j 3 4 5 6 7 8 9 0 1 2
i j i j . . . a
a . a a
upper bound = 17
lower bound = 12
? 13
Would weighting by incompatibilities
help?
6, Weighting
7, information from sequence order not used Alignment Reordered Alignment
original sequence order shuffled/reordered AIIFLNSALGPSPELFPIILATKVL ASAGPSPPATPLLIIIILLFFNEKV AIMFLNSALGPPTELFPVILATKVL ASAGPPTPATPLLIMVILLFFNEKV SIMFLNHTLNPTPELFPIILATETL SHTNPTPPATPLLIMIILLFFNEET TILFLNSSLGLQPEVTPTVLATKTL TSSGLQPPATPLLILTVLVTFNEKT TLLFLNSMLKPPSELFPIILATKTL TSMKPPSPATPLLLLIILLFFNEKT ALLFLNSTLNPPTELFPLILATKTL ASTNPPTPATPLLLLLILLFFNEKT AILFLNSFLNPPKEFFPIILATKIL ASFNPPKPATPLLILIILFFFNEKI
c! ways to reorder alignment shuffle by columns & by taxa
8. could we use ‘words’ of 2, 3, 4, 5, … letters
9. Alphabet reduction?
damned eukaryotes!
limits of evolutionary mechanisms - no miracles
continuity all intermediates ‘functional’ can’t evolve “for” what doesn’t exist
Protein synthesis? mRNA, tRNAs, rRNAs, triplet code - why 3?
the origin of protein synthesis?
Eigen limit 1
master sequence
Eigen limit 2
mutation
selection 0
Eigen limit 3
mutation
selection 0
Eigen limit 4
mutation
selection 0
~1 error per replication,
error catastrophe,
mutational meltdown
error rates
the error rate limits the length that can be copied, Manfred Eigen (1971) 2 errors in 37 copies of hammerhead ribozyme, 20-fold improvement
ribavirin and polio viruses
Crotty et al. PNAS 98, 6895 2001
synthesis of RNA
from cyclic GTPs
hydrolysis ↔ polymerisation
+H3N C C N -C C O- + H2O
O H O H
R1 R2
H3N+CC O- + H3N+C C O-
O H
R}
O H
R2
α α α α
two monomers ↔ one dimer (+H2O).
heat amino acids dry, or drying cycles (fluctuating clay environment) or frozen in ice?
↔
RNA evolution - in vitro
enzyme efficiency RNA ⇒ RNP ⇒ protein
CATALYST Kcat
(min-1
) Kcat/Km
(M-1
min-1
)
RNA Tetrahymena L-21(SacI) polynucleotide kinase RNase P RNA
0.1 0.3
1
9.0 x 107 6.0 x 103 2.0 x 106
RNP RNase P RNA + protein 2 4.0 x 106 protein RNase T1
T4 polynucleotide kinase triose-P isomerase carbonic anhydrase
5,700 25,000
258,000 600,000,000
1.1 x 108 6.0 x 108 1.4 x 1010 7.2 x 109
RNA copied by protein, error rate is high
origin of protein synthesis??
NNN
N
N-TP
A
B
C
NNN
How long would a G=C pairing last?
origin of protein synthesis?
D
new RNA strand
template strand
ribozyme-catalysed decoding, cleavage and
ligation functions
A C C
aa+
aa+
aa+
activated tRNA
activated tRNA
activated tRNA
inactivated tRNA
A C C
NNN
A C C
1 2
3
predictions
Theoretical/computational time of (say) GC binding is it increased by heavy RNAs di and trinucleotides Experimental length of RNA copied (dinucleotides) amino acid codes???