23
Evolutionary Principles of Genomic Compression David C. Krakauer Santa Fe Institute, Santa Fe, New Mexico, USA Genomic compression describes those mechanisms employed by organisms to reduce the total number of nucleotides required to encode functional polypeptides, or those mechanisms facilitating an increase in information transfer rate for a fixed length of message. We can speak therefore of replicatory and regulatory genomic compression. Compression can be achieved through redundancy in the translational apparatus, through overlapping messages in the DNA or RNA sequence, and through a reduction in the length of translated sequences. Compression can also lead to greater coordination in protein production by coupling the translation of functionally related genes. Mathematical models are presented which describe the mechanical and functional principles of genomic compression. Keywords: genome size, compression, gene regulation, evolution, model, overlapping gene, population genetics THE SELECTION FOR COMPRESSION The Saturnian stretched out his hand, seized with great dexterity the ship which carried those gentlemen, and placed it in the hollow of his hand without squeezing it too much, for fear of crushing it ... . It was not until both Sirian and Saturnian examined the ‘‘turds’’ with microscopes that they realized the amazing truth. When Leeuwenhoek and Hartsoeker The author is supported by core grants to the Santa Fe Institute from the John D. and Catherine T. MacArthur Foundation, the National Science Foundation, and the U.S. Department of Energy. Thanks to Akina Sasaki for his many insights into the structure and function of adaptive landscapes. Address correspondence to David C. Krakauer, Santa Fe Institute, Hyde Park Road, Santa Fe, NM 87501, USA. Email: [email protected] Comments on Theoretical Biology, 7: 215–236, 2002 Copyright # 2002 Taylor & Francis 0894-8550/02 $12.00 + .00 DOI: 215

Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Embed Size (px)

Citation preview

Page 1: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Evolutionary Principles of Genomic

Compression

David C. KrakauerSanta Fe Institute, Santa Fe, New Mexico, USA

Genomic compression describes those mechanisms employed by organisms

to reduce the total number of nucleotides required to encode functional

polypeptides, or those mechanisms facilitating an increase in information

transfer rate for a fixed length of message. We can speak therefore of

replicatory and regulatory genomic compression. Compression can be

achieved through redundancy in the translational apparatus, through

overlapping messages in the DNA or RNA sequence, and through a

reduction in the length of translated sequences. Compression can also lead

to greater coordination in protein production by coupling the translation of

functionally related genes. Mathematical models are presented which

describe the mechanical and functional principles of genomic compression.

Keywords: genome size, compression, gene regulation, evolution, model, overlapping gene,

population genetics

THE SELECTION FOR COMPRESSION

The Saturnian stretched out his hand, seized with great dexterity the shipwhich carried those gentlemen, and placed it in the hollow of his handwithout squeezing it too much, for fear of crushing it . . . . It was not untilboth Sirian and Saturnian examined the ‘‘turds’’ with microscopes thatthey realized the amazing truth. When Leeuwenhoek and Hartsoeker

The author is supported by core grants to the Santa Fe Institute from the John D. and

Catherine T. MacArthur Foundation, the National Science Foundation, and the U.S. Department

of Energy. Thanks to Akina Sasaki for his many insights into the structure and function of

adaptive landscapes.

Address correspondence to David C. Krakauer, Santa Fe Institute, Hyde Park Road, Santa Fe,

NM 87501, USA. Email: [email protected]

Comments on Theoretical Biology, 7: 215–236, 2002

Copyright # 2002 Taylor & Francis

0894-8550/02 $12.00 + .00

DOI:

215

Page 2: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

first saw, or thought they saw, the minute speck out of which we areformed, they did not make nearly so surprising a discovery. Whatpleasure Micromegas and the dwarf felt in watching the movements ofthose little machines, in examining their feats, in following theiroperations! How they shouted with joy! (Voltaire, Micromegas)

One of the most remarkable features of life on earth is its diversity. Thisstands in stark contrast to our typical representation of extraterrestrial lifeforms, typically conceived of as anatomically extended primates, that aremoreover the sole representatives of life within their civilization. In ourdepiction of aliens, we are decidedly pre-Darwinian. We have imagined aworld in which there is but a single species of a single form. Perhapstelescopic distance collapses diversity to such an extent that we are incapableof conceiving of differences. Like Voltaire’s gigantic Sirian and Saturnian,who found it difficult to imagine minute humanity, merely homogenousspecks—or worse, ‘‘turds’’—capable of reason. Darwin provided us with atheory with which we might understand the origins of diversity at all scalesof organization—a theory that I shall argue is particularly suited to thesmaller scales of organization. The theory only requires some meansof generating diversity, a mechanism for transmitting information, and aphenotype–environment correlation producing selective differences amongvariants (Lewontin 1970). Those variants best able to survive (high viability),best able to replicate (high fertility), and best able to reproduce will come todominate in a population. Darwin’s explanation for diversity is summarizedin his much-cited paragraph on the tangled bank:

It is interesting to contemplate an entangled bank, clothed with manyplants of many kinds, with birds singing on the bushes, with variousinsects flitting about, and with worms crawling through the damp earth,and to reflect that these elaborately constructed forms, so different fromeach other, and dependent on each other in so complex a manner, haveall been produced by laws acting around us. These laws, taken in thelargest sense, being Growth with Reproduction; inheritance which isalmost implied by reproduction; Variability from the indirect and directaction of the external conditions of life, and from use and disuse; aRatio of Increase so high as to lead to a Struggle for Life, and as aconsequence to Natural Selection, entailing Divergence of Characterand the Extinction of less-improved forms (Darwin 1859).

A notable feature of this excerpt, and of Darwin’s theory in general, isthat it nowhere provides any mention of the expected range of diversity,or of those biological laws that might operate to constrain naturaldiversity. Are there such principles whereby we might predict the diversityof life on earth? Rather than throwing up our hands at the formidabledifficulty of this problem, we should consider rephrasing it, all the while

216 D. C. Krakauer

Page 3: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

ensuring that its character remains intact, but where the degrees offreedom are sufficiently reduced to make the problem tractable. One wayin which to do this is to substitute diversity in life form with diversity ingenome size.

Every naturally occurring, living entity on the planet Earth relies on someform of nucleic acid for inheritance. For all but a few laboratory strains, thismeans RNA or DNA. Furthermore, all living organisms make use of a tripletcode—one in which three nucleotides are required to specify an amino acid—and employ amino acids as the building blocks of proteins. Even upon a moredetailed microscopic inspection of the genetic apparatus there are observedcommon principles of replication, transcription, and translation. But when weturn to the size of the primary RNA or DNA sequence, measured in the totalnumber of nucleotides (the C-value), similarities across species all butdisappear. In a virus such as fX there are around 0.0000054� 109 base pairs,in the bacterium Escherichia coli about 0.004� 109, nucleotide base pairs, inyeast 0.004� 109, in flies 0.18� 109, and in humans 3.5� 109. Lest genomesize be mistaken for some measure of phenotypic or evolved complexity, theprimitive lungfish possesses on the order of 140� 109 base pairs and thecommon fritillary 130� 109. This is a range of genome sizes spanning over8 orders of magnitude.

The questions that I address in this article all relate to those pressures andconstraints favoring small genome sizes, in particular adaptive theories thatseek to explain how a given quantity of information can come to berepresented by a small message. In other words, I seek to discuss mechanismsfor compressing genetic information. One thing that we can be certain of isthat, all else being equal, larger genomes provide the opportunity for moregenes and consequently more proteins. More genes and proteins allow forpotentially greater control over replication, metabolism, and the environment.If as much functional information could be packed into a virus genome witharound 104 base pairs as into a human genome with 109 base pairs, it is likelythat there would be far less diversity in genome size.

In this review I consider genome size only from an adaptive compressionperspective. There are many alternative hypotheses that seek to explainvariation in genome size. These include theories of neutral drift, self-replicating parasitic sequences, the contribution of nucleoskeleton structure,and a symbiotic or parasitic mode of life.

Neutral drift theories contend that the selective differences brought aboutthrough changes in nucleotide content are so small as to be unimportant toorganismal fitness. Increases or decreases in genome size reflect mechanisticbiases favoring nucleotide accumulation or nucleotide deletion (Ohno 1976;Charlesworth 1996; Pagel and Johnstone 1992).

Parasitic sequences, such as self-splicing introns, are able to propagatethemselves horizontally within the host genome. Variation in genome sizereflects a mutation selection balance in which parasites strive to increase theirnumbers, whereas the host seeks to purge unwanted nucleotides (Doolittle

Evolutionary Principles of Genomic Compression 217

Page 4: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

and Sapienza 1980; Orgel and Crick 1980). Variation in genome size reflectsparasite exposure, parasite tropism, and parasite tolerance.

The total number of nucleotides within a single copy of the sequence of anorganism’s genome is positively correlated with cell volume (Cavalier-Smith1982). Cell volume bears significantly on the rates of cell division, cellularmetabolism, and development. Once selection has led to large cell volumes,the bulk properties of DNA, rather than the coding properties of its sequence,can play an important role as a support structure or nucleoskeleton (Cavalier-Smith 1982).

Numerous pathogenic and mutualistic intracellular bacterial speciespossess genomes smaller than their free-living relatives. The reduction ingenome size is thought to reflect the elimination of those genes encodingpeptides that are readily available from the host species (Anderson andKurland 1998). The causes for genome reduction are hypothesized to be areduction in mutation load and=or a reduction in the cost of carrying a largegenome.

All of these hypotheses are supported by data from various taxonomicgroups. None deal explicitly with the selective consequences of compressinginformation through reduced redundancy, overlapping genes, or translationalcoupling. These issues are discussed in this article.

We can organize our thoughts on this subject by recognizing three verygeneral principles favoring genomic compression

* The stable propagation of information.* The rapid propagation of information.* The efficient processing of information.

Each of these very general biological principles can be coupled to moremathematical concepts: stability to the concept of the error threshold,propagation to strategies for rapid replication, and processing to the efficientexploitation of finite resources. In the following review I adapt ideas fromKrakauer (2000), Krakauer and Plotkin (2002), Krakauer and Jansen (2002),and several other works cited throughout.

BIOLOGICAL PRINCIPLES OF COMPRESSION

A single stretch of DNA or RNA encodes multiple messages—moststraightforwardly—as nonoverlapping linear arrays of genes, encodingproteins by means of a one-to-one mapping of codon to amino acid. At ahigher level, the choice of nucleotide within a chromosome can also reflectselection for chromatin binding motifs or for stabilizing sequences of GC-rich DNA. These give rise to the supragenic structures known as isochores.Within a gene there are further constraints associated with the choice ofcodon for a given amino acid. There are codes within codes (Trifinov 1989).Selection operates on all of these codes at once, and it is unlikely that

218 D. C. Krakauer

Page 5: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

they can be concurrently optimized. The genomes we see today are thehistorical products of each of these countervailing requirements. In thissection I explore how information is compressed in the genetic code and inthe translational apparatus and what implications this has for organismalfitness.

Redundancy and Degeneracy in Translation

Each amino acid is associated with one or more codon of three nucleotidesof DNA or RNA. Each codon encodes an amino acid through an adaptormolecule carrying an amino acid—the transfer RNA (tRNA). The tRNAbinds to the codon with a complementary anticodon according to theWatson–Crick base pairing rules. These describe the purines adenine andguanine binding with the pyrimidines thymine and cytosine. As there are 64possible codons and only around 20 amino acids, the mapping from codon toamino acid is many to one, giving rise to synonym redundancy in the geneticcode. The vast majority of organisms employ the same set of codons toencode amino acids. This led to the code being referred to as universal. Moredetailed comparative analysis of translation has revealed that there are variantassignment rules, particularly in mitochondria, bacteria, and protists. Theuniversal code has been demoted to the canonical code. If we examinetranslation in still greater detail, we find an even larger increase in diversity.This diversity arises in the mapping of codon to anticodon and, by extension,anticodon to amino acid. The genetic anticode is not even canonical.

We can quantify the diversity of the genetic anticode in terms ofredundancy and degeneracy. Anticodon redundancy refers to those instancesin which single anticodons process several codons in order to encode anamino acid, whereas anticodon degeneracy involves several anticodonsbinding several codons, to encode a single amino acid. Within the standardcode, the total number of tRNAs can vary theoretically between 22, in whichcase each anticodon recognizes on average 3 codons, to 61, in which caseeach anticodon binds strictly to its complementary codon (setting aside the 3stop codons). This leads to two or more tRNAs carrying the same amino acid(isoaccepting tRNAs). Using the above terminology, employing a largenumber of specific anticodons (61, for example) to encode 20 amino acids isa strategy of degeneracy, as nonidentical anticodons must map onto acommon set of amino acids. By employing a smaller number of anticodonsthan codons, the system becomes redundant, as the same anticodon isrequired to bind to more than one codon.

A redundant strategy allows genomes to carry fewer tRNA molecules andacts as a simple mechanism for error buffering by not distinguishing amongsimilar codons. However, redundancy involves a potential cost in terms ofreduced binding specificity and more frequent mismatch errors. Degeneracyprovides greater specificity of binding, reducing translation errors by

Evolutionary Principles of Genomic Compression 219

Page 6: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

reducing the incidence of near-cognate codon readings. It also has the virtueof greater evolutionary flexibility as the number of elements available forpotential modification is increased. With a degenerate strategy, the set ofbound codons can be modified by simply expanding or contracting the set ofanticodons. However, a degenerate strategy requires more tRNAs and anattendant increase in the genome size.

Kinetics of Translation and Replication

Following Krakauer and Jansen (2002) we assume that there are mdifferent codons, m anticodons, and n amino acids. The m � 1 vector ~cspecifies the abundances of each of the m codons in the RNA strain. Theprocess of translation involves matching these codons with tRNA anticodons.The abundance of anticodon i in the cell is given by elements vi of theanticodon vector~v. The binding rate of anticodon i with codon j is given byelement wij of the binding matrix W.

The total rate of binding to codon j is given byPm

i viwij; hence, theaverage time it takes for a codon to be matched is given by

fj ¼Xm

i

viwij

!�1

ð1Þ

It also follows that codon j is matched with anticodon i with probability

uij ¼ viwij fj ð2Þ

Because each codon will be matched with an anticodon, we havePmi¼1 uij ¼ 1. The m � m matrix U has the probabilities uij as elements.

Each tRNA anticodon is associated with an amino acid. The associationbetween anticodon and amino acid is described by the n � m binary (i.e.,containing only zeros and ones) matrix A. The elements of A are dentoted aij.Each column of A consists of n � 1 zeros and only a single one as there is aunique association between tRNA and amino acid. In case there isdegeneracy, rows of A can contain many ones. Matrices in which anyelement uij < 1 have redundancy in the ith anticodon. The abundance inamino acids after translation is given by AU~c.

The total translation time is given by the sum of the time it takes to initiatethe translation process, E (binding of the message to the ribosome), and thetotal time to match all codons

t ¼ EþXm

j

ðcj fjÞ ð3Þ

220 D. C. Krakauer

Page 7: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Assuming that the extension of the protein is the rate-limiting step, and notthe removal of improperly bound anticodons, the rate of translation is givenby q ¼ 1=t.

A similar reasoning might be applied to replication of DNA or RNA.For the sake of conciseness we might assume that replication rate isreduced by a constant amount for each additional codon and tRNA genecarried by the genome. The total number of codons is fixed by functionalconstraints. The number of anticodons, as we have already established, canvary. The replication time is then a function of total genome length toinclude k

Pi vi. The constant k is the average time taken to replicate

codons of a tRNA molecule. In other words, replication rate isproportional to

1

G þ kP

i vi

where G is the replication time of all non–tRNA-related codons. The fitnessof an asexual organism—ignoring for the present rates of deleteriousmutation—is a product of its viability (maximum gene expression rate) andreplication rates,

w ¼ q

G þ kP

i vi

ð4Þ

It will be observed from Eqs. (1)–(3) that an increase in redundancy, for afixed number of anticodons, reduces the rate of translation, whereas we see inEq. (4) that an increase in degeneracy increases the rate of replication.Maximizing fitness involves a trade-off between replication and viability. Inorganisms with small genomes the term k

Pi vi in the denominator of Eq. (4)

will be relatively large in relation to G. Thus we expect there to be greaterpressure toward redundancy. In larger genomes the term G will dominate andmaximum degeneracy would seem to be desirable. This leads us to thediscovery of an important design principle: Genomic compression throughanticodon redundancy becomes more important as the genome becomessmaller. This is the general pattern that we observe in nature (Krakauer andJansen 2002).

Genomic Organization and Error Thresholds

Following the groundbreaking work of Beadle and Tatum (1941) onmutant lineages of neurospora, the dominant view was that one gene led tothe production of one polypeptide. The assumption of collinearity of geneticmessage and protein product played a pivotal role in helping to explain themechanism of genetic translation (Fox Keller 2000). However, an examination

Evolutionary Principles of Genomic Compression 221

Page 8: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

of the genomes of viruses, bacteria, and protists shows significant violationsof this assumption.

By far the majority of viruses and bacteria contain stretches of DNA orRNA in which constituent nucleotides are translated into two or moredifferent polypeptides. Regions of the genome in which there is this one-to-many translational mapping are termed overlapping reading frames (Normarket al. 1983; Miyata and Yasunaga 1978). Overlapping reading frames allow alarger number of proteins to be encoded by a single stretch of nucleotidesthan would be feasible with one gene and one polypeptide. The informationfor two or more proteins is compressed into the information space typicallyoccupied by a single protein. This raises a number of mechanical andfunctional questions: (1) How do two or more genes come to occupyoverlapping sequences? (2) What are the implications of overlapping readingframes for the robustness of the underlying genetic message? (3) What are theadvantages of overlapping reading frames in terms of replication? (4) Arecertain kinds of genes more likely to be found overlapping?

Before exploring some of these questions as they relate to compression,we need to consider certain evolutionary limits placed on any replicat-ing strand of DNA or RNA, in the absence of overlap. The theory ofmutation–selection balance on genetic sequences was first developed byEigen (1971) to explain the dynamics of RNA replication in a flow reactor.The standard presentation of this concept involves establishing a sequencespace, imposing fitness differentials upon this metric (a fitness landscape),and writing down a dynamical system determining the time evolution ofreplicators moving through this space. The sequence space is constructed byassuming that there is a set of sequences of uniform length comprising Nmonomeric units, each of which can be drawn from a class of size C. Thedimension of the sequence space is N and the total number of uniquesequences CN . We can calculate the distance between any two sequences inthis sequence space using the Hamming distance dði; jÞ, which tabulates thenumber of positions in a sequence of length N at which monomers aredifferent. Assuming a wild-type sequence Sw, we can generate Hammingclasses by grouping together all of those sequences equidistant from the wild-type, Sd, where d ¼ dðw; jÞ 2 ð1; 2; . . . ;NÞ. These Hamming classes areimportant as we shall impose radial symmetry constraints on our adaptivelandscape whereby we assume that all members of a given Hamming classhave identical fitness. Moreover, adjacent classes can be reached through asingle monomer=base mutation.

During the replication of sequences there are small probabilities of muta-tional errors. Each site is replicated accurately with a probability q, and henceis mutated with a probability 1 � q. The mutation probabilities between allsequence pairs ðSi; SjÞ of the CN sequences are given by a mutation matrix

Q ¼ fQij; i; j ¼ 1; 2; . . . ;CNg ð5Þ

222 D. C. Krakauer

Page 9: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

where

Qij ¼ qN�dði; jÞ 1 � q

C � 1

� �dði; jÞð6Þ

In the standard quasi-species equation it is assumed that the rate constants forreplication of sequence Si are given by ai, and sequence decay di. Thedifferential equation is,

_xi ¼ ½aiQii � di � fðtÞ�xi þXj 6¼i

ajQjixj i; j ¼ 1; . . . ; T ð7Þ

where the total population size ðTÞ is kept constant by matching theproductivity of the system with an outflow or ‘‘flux’’ term, fðtÞ ¼P

i¼1ðai � diÞxiðtÞ. Sequence xi increases in abundance through replicationor through mutation of a sequence Sj into Si with a probability Qji. Thisequation can be substantially simplified by assuming that the decay rates ofall sequences are equivalent, and that the back mutation is negligible incomparison with forward mutation:

Pj6¼i ajQjixj aiQii. This gives us a

linear system

ðAQ � fÞx ¼ 0 ð8Þ

where the matrix A is the diagonal matrix of replication rates, f thediagonal matrix of outflows, and x the vector of xi values. This is now astandard eigenvalue problem, in which the equilibrium distribution of xis given by the eigenvector associated with the dominant eigenvalueof AQ.

Plateau Landscapes

To determine the whereabouts and stability of the equilibria, we need toestablish a geometry of the adaptive landscape. The simplest assumption isthat of a single peak landscape, in which the wild-type sequence hasreplication rate aw and all other Hamming classes a fitness am, whereaw > am. It is common to assume that aw ¼ 1 and am ¼ 1 � s, where s is theselective cost of mutation. The selective advantage of the wild-type isa ¼ 1=am. The equilibrium frequency of the wild-type sequence in theplateau landscape is then given as

�xw ¼ Qwwa� 1

a� 1ð9Þ

A stability analysis of this equilibrium shows us that this equilibriumremains stable whenever Qww > 1=a, or, written differently, when

Evolutionary Principles of Genomic Compression 223

Page 10: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Qww > am or Qww > 1 � s. In order to determine the relationship betweenstability and sequence length we solve Qww ¼ 1=a, which can be rewrittenas qN ¼ 1=a, to give us Ncrit ¼ lnðaÞ=ð1 � qÞ. For reasonable values ofa ð1 < a < 109Þ, we find that 1 � q � 1=Ncrit. The error threshold forplateau landscapes is given by a mutation rate equal to the reciprocal of thegenome length. From the perspective of genome compression, the errorthreshold concept illustrates why smaller genomes are more stable thanlarger genomes: They require lower replication fidelity in order to persist.To ensure that the wild-type is not lost as a result of an increase in genomesize, there must be an unrealistically large increase in the fitness of thewild-type. Thus the error threshold imposes a maximum on the amount ofinformation that can be transmitted through a linear genome. An increase ingenome length that brings about new adaptive functions, without anincrease in replication fidelity, is very unlikely to persist. This has profoundimplications for the trajectory of the evolution of replicators across plateaulandscapes. It states that in order for new gene functions to evolve, areplicator must first have a sufficiently low rate of error to accommodatefurther contributions to genome length. Stated differently, innovations thatincrease replication fidelity will occur before innovations that increasereplication rate.

Multiplicative Landscapes

A more realistic landscape is the symmetric, multiplicative landscape ofthe form ai ¼ ð1 � sÞi

for which i are the number of mutations and s theselective costs of mutations. There are N þ 1 equilibrium solutions to thissystem, in which each equilibrium varies in the number of wild-typemonomers maintained, from N down to 0. This system does not strictly havea ‘‘threshold’’ or phase transition at a critical value of the mutation rate, as isobserved in the plateau landscape. The abundance of the wild-type sequencegradually decreases for increasing rates of mutation. However, it remains trueto say that when the wild-type S0 is eventually outcompeted, then thepopulation will be driven to extinction. It is convenient to use the notationm ¼ ð1 � qÞN

.The equilibrium frequency distribution in the single peak, multiplicative

function landscape is given as

xi ¼N

i

� �1 � m

�s

� �i m�s

� �N�i

ði ¼ 0; 1; . . . ; nÞ ð10Þ

in which the mean number of functional sites is �i ¼ nð1 � m=�sÞ, and thepopulation mean fitness is �w ¼ ð1 � mÞn

. As with the plateau landscape, thestabilities of the equilibria are calculated by linearizing the dynamics. After a

224 D. C. Krakauer

Page 11: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

rather involved set of calculations the condition for stability of the wild-typeis given by

m < s ð11Þ

This result shows that in multiplicative landscapes the error threshold isindependent of genome length. It is, however, misleading, as increasing thegenome length leads to very large reductions in the abundance of the wild-type. We can show this by writing down the ratio of the wild-type to thenearest one-monomer mutant class:

xN�1

xN

¼ Nm=s

1 � m=s< 1 ð12Þ

or

N þ 1 <s

mð13Þ

and hence for large genomes,

Ncrit �s

mð14Þ

The multiplicative adaptive landscape in the limit of s ! 1 converges to theplateau landscape and yields the error threshold Ncrit � 1=m. The importantdifference from the plateau landscape for the evolution of compression is thatgenome size can increase through genetic innovations that provide adaptivebenefits without prior increases in mutation fidelity. In other words, genomesize is not as important a factor in genomic evolution across multiplicativefitness landscape as it is across plateau landscapes.

Introducing Structure: Overlapping Reading Frames

The standard quasi-species model does not distinguish among sites in thegenome, and all sequences are of equivalent length. In the simplest plateaulandscape, a single mutation at any site leads to the maximum fitnessdecrement. In continuously differentiable landscapes, a larger number ofmutations are associated with a larger decline in fitness. Genomes withoverlapping genes experience different fitness reductions according to theregion in which the mutation occurs. In overlapping regions, more than onegene can be damaged; in nonoverlapping regions only a single gene can bedamaged. Within the overlapping regions there is variation in mutation loadaccording to the direction and phase of overlap. Furthermore, sequences areno longer of uniform length and vary according to their degree of overlap, M.This fact can lead to variation in replication rates. We need to consider eachof these properties of overlapping genes in order to derive an appropriateadaptive landscape.

Evolutionary Principles of Genomic Compression 225

Page 12: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Mutation Incidence

Consider a sequence comprising two reading frames each of length N andoverlapping reading frame length M. The total genome length is then givenby 2N � M. If mutation rates are homogeneous across the genome, theprobability of a single mutation falling within an overlapping region isproportional to

p ¼ M

2N � Mð15Þ

and in a nonoverlapping region is

v ¼ 2ðN � MÞ2N � M

¼ 1 � p ð16Þ

The values p and v determine the probable whereabouts of mutations withinthe genome.

Direction and Phase of Overlap

Once a mutation arises in an overlapping sequence, the mutation load willdepend upon the reading frame within which overlap is observed. This isbecause the degree to which polypeptides encoded by overlapping genesachieve selective independence depends upon the direction and phase ofoverlap. Overlap can increase the mutational load in a genome by reducingthe effective redundancy of a sequence. Consider a pair of overlapping genes,in which gene I is read þ1 out of phase of gene II. Translation is initiated onenucleotide closer to the 50 end of the genome. The first codon position of theþ1-phaseðFþ1Þ gene will correspond to the second codon position of the0-phase gene ðFþ0Þ, whereas the first codon position of the 0-phase genecorresponds to the third codon position of the þ1-phase gene. The two phasesof parallel overlap and three of anti-parallel overlap are constructed byshifting each nucleotide along a codon by Fþ1;Fþ2; or F�0;F�1;F�2

counting mod 3:

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

. . . 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

. . . 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

. . . 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1

. . . 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1

3 2 1 3 2 1 3 2 1 3 2 1 3 2 1

We can provide an estimate of the mean mutational load associated witheach of these five overlapping sequence configurations. By examining a table

226 D. C. Krakauer

Page 13: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

of the canonical genetic code, we determine the probability of a singlemissense or nonsense mutation in translated polypeptides (nonsynonymousmutations) as a result of mutations to the first, second, or third position of acodon. The probability of an amino acid mutation to k polypetides translatedfrom an overlapping reading frame i nucleotides out of phase is denoted bypkðF�iÞ, where � takes the values þ;�. At the first position the probability ofa nonsynonymous substitution is .65, at the second position .75, and at thethird position .3. The corresponding synonymous mutation probabilities are.35, .25, and .7. The average probability of mutation to two polypeptidesderived from an overlapping gene is given by the probabilities ofnonsynonymous mutations at any one of the three codon positions and theprobabilities of simultaneous mutations to each of the polypeptides. For agene in þ1 or þ2 phase, the probabilities, denoted by p2ðFþ1Þ and p2ðFþ2Þ,are given by 1

3ð0:65 � 0:75 þ 0:75 � 0:3 þ 0:65 � 0:3Þ ¼ :30. Through

identical calculations, the probabilities for the remaining configurationsare: p2ðF�0Þ ¼ :32; p2ðF�1Þ ¼ :35; p2ðF�2Þ ¼ :29. The probability of asingle mutation to a translated polypeptide ½ðpiðF�iÞ� is derived from thesynonymous mutation probability and the simultaneous nonsynonymousmutation probability: p1ðF�iÞ ¼ 1 � p0ðF�iÞ � p2ðF�iÞ.

Replication Rate

Fitness is a product of both viability and fertility or replication rate. In theabsence of overlap, fitness is synonymous with viability, as all sequences areof uniform length. When the degree of overlap is free to vary, then the fitnessfunction should reflect the replication costs of carrying a larger number ofmonomers. Replication time increases linearly with increasing sequencelength when nucleotides are not rate limiting and there is a single replicationinitiation site. We allow heterogeneity in genome size to influence replicationrate through a weighting function,

wðMÞ ¼ N

2N � Mð17Þ

When overlap is at its maximum for two genes of equal length, M ¼ N, thereplication rate weighting factor will be equal to 1. When there is no overlapðM ¼ 0Þ the weighting terms will be equal to 1

2. Assuming a genome

comprising two genes of equal length there is a twofold advantage inreplication rate for complete overlap. There is an n-fold advantage forcomplete overlap in a genome comprising n identical length genes.

Adaptive Landscapes with Overlap

In order to specify the adaptive landscape for overlapping genes, we takeinto account mutation incidence, the nature of the overlap, and the replication

Evolutionary Principles of Genomic Compression 227

Page 14: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

rate. We also need to choose the geometry of the landscape. We can either(a) consider the average effect of mutations falling into nonoverlapping andoverlapping regions of a sequence with a prescribed degree of overlap or(b) split the sequence population into two subpopulations in which onepopulation describes all those sequences with mutations to nonoverlappingregions, and the second population, sequences with mutations to overlappingregions. The first approach is more approximate but allows us to describe allmembers of the population with a single fitness function. The secondapproach is more accurate but requires that we describe each overlappingmutant class independently. For plateau landscapes, the second approach isrelatively simple, as the dimensions of the system are kept small fromgrouping together sequences of identical fitness. I concentrate on the plateaulandscape and only touch on the multiplicative landscape.

Plateau Landscapes with Overlap

Denote the wild-type sequence x0, those genomes with mutationsfalling within a nonoverlapping sequence x1, and genomes with mutationsin overlapping sequences x2. Mutations to nonoverlapping sequences can beeither synonymous or nonsynonymous. Mutations falling within overlappingsequences can be synonymous, nonsynonymous within a single translatedpolypetide, or nonsynonmous within two translated polypeptides. The pergenome per generation mutation rate is given by m ¼ q2N�M and we neglectthe structure of the mutation matrix Qij. For a genome with an overlappingreading frame M, the replication dynamics are given by

_x0 ¼ ½a0ð1 � mÞ � d � fðtÞ�x0 ð18Þ

_x1 ¼ ½a1 � d � fðtÞ�x1 þ a0mx0v ð19Þ

_x2 ¼ ½a2 � d � fðtÞ�x2 þ a0mx0p ð20Þ

The replication rates ai for each mutant class with direction and phase ofoverlap F�j are given by:

a0 ¼ wðMÞ ð21Þ

and

ai ¼ wðMÞ 1

i þ 1

Xi

k¼0

pkðF�jÞð1 � ksÞ ði ¼ 1; 2Þ ð22Þ

with a flux term,

fðtÞ ¼Xi¼0

xiðtÞðai � dÞ ð23Þ

228 D. C. Krakauer

Page 15: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

The stability of the wild-type is determined by the dominant eigenvalue ofthe linearized system. From the preceding definitions, stability is achievedwhen

1 � m >a0

a1

ð24Þ

which, upon substitution from Eq. (22) gives

1 � m >2

p0ðF�jÞ þ p1ðF�jÞð1 � sÞ ð25Þ

From the left-hand side (LHS) of Eq. (25) the stability of the wild-typeincrease as m decreases, which is true for increasing amounts of overlap M.Compressed genomes can tolerate larger error rates, as the total number ofmutations per genome is reduced. Stability also depends on the phase anddirection of overlap, represented by the values of the p0ðF�jÞ and p1ðF�jÞterms. Increases in the probabilities of nonsynonymous substitutions increasethe stability of the sequence.

An important consequence of compression is a reduction in sequenceredundancy. Sequence redundancy can be defined approximately as

r ¼ p0ðF�jÞP2i¼0 piðF�jÞ

ð26Þ

which is simply the relative probability of a synonymous substitution in asequence with overlap phase F�j. High values of redundancy threaten thestability of the wild-type. Low values of redundancy bolster the stability ofthe wild-type. This is because redundancy reduces the competitivedifferences among fitness classes in a population and thereby threatens thelong-term persistence of mutation free genomes. A reduction in redundancyincreases the mutation load and thereby more effectively purges mutantgenomes from the population (Krakauer and Plotkin 2002; Krakauer andNowak 1999). Genomic compression can thereby increase stability in twoways: (1) by reducing mutation incidence and (2) by reducing sequenceredundancy.

This is the opposite result from those described in information theory inwhich redundancy increases the fidelity of information transfer across anoisy channel. The explanation for this important difference derives fromthe population dynamical transmission of information in evolutionarysystems. Here we have multiaccess information but where the channelremains undivided and a large number of transmitters compete for theirmessage to be processed by the receiver. The transmission of information isnoncooperative. With small populations, redundancy in the message is onceagain favored, as it can increase the probability that not all individuals

Evolutionary Principles of Genomic Compression 229

Page 16: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

within the population harbor defective genomes (Krakauer and Plotkin2002).

Multiplicative Landscapes with Overlap

With multiplicative landscapes we are required to track each mutantclass. With two nonoverlapping genes of identical length N there are 2N

i

ways of experiencing i mutations. With an overlap of size M there are

2N�Mi�k

ways of experiencing i � k mutations to a nonoverlapping

sequence and Mk

ways of experiencing k mutations in an overlapping

sequence. To calculate the mean fitness of a genome with a total imutations we need to consider probable mutations to both overlapping andnonoverlapping sequences, and the direction and phase of overlap withinthese sequences:

ai ¼ wðMÞX

k

i

k

� �pkvi�k½1 � sp1ðF�jÞ � 2sp2ðF�jÞ�k½1 � sp1ðF�jÞ�i�k ð27Þ

¼ wðMÞvi½1 � sp1ðF�jÞ�i 1 þ p1

vþ 2sp2ðF�jÞvsp1ðF�jÞ � v

� � �i

ð28Þ

This function represents the expected fitness after having experienced k outof i mutations to an overlapping sequence and i � k mutations to anonoverlapping sequence. Mutations falling within an overlapping sequencecan lead to 0, 1, or 2 amino acid substutions in a protein with a fitness cost of0, s, or 2s. Mutations falling within a nonoverlapping sequence can lead to 0or 1 amino acid substitution with a fitness cost 0 or s. Setting p ¼ 0, thepreceding equation reduces to the simple multiplicative fitness function for anonoverlapping sequence of monomers.

The analysis of the stability properties of this model is beyond the scope ofthis review. However, a few general properties of this landscape can be seenimmediately. First, overlapping-gene landscapes are steeper that thenonoverlapping landscapes, as 1 � sp1ðF�jÞ � 2sp2ðF�jÞ < 1 � sp1ðF�jÞ.Second, increasing the sequence redundancy (r) through changes in thedirection and phase of overlap leads to a reduction in the steepness of thelandscape—in other words, a reduction in

P2i¼0 piðF�jÞ. As landscape

steepness determines the stability of the wild-type, we see directly howredundancy influences heritability.

Overlap and Limits to Evolvability

In the previous section I showed how a compressed sequence encodingmore than one protein is more vulnerable to mutations than a nonoverlapping

230 D. C. Krakauer

Page 17: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

gene. There is another potential fitness constraint on overlapping genes. It isperhaps unrealistic to assume that the same amount of information can beencoded in a sequence of length 2N as in a compressed sequence of length N.Standard information theoretical considerations of compression algorithms—used to source encode messages—do so by removing all redundancies. It isassumed therefore that the underlying message remains the same. Inevolutionary contexts, we worry about not only current performance, butpotential performance in alternative environments. Genomes, particularlythose of bacteria or viruses, need to remain adaptable. Overlapping readingframes impose a limit to adaptability by introducing correlations, describedby geneticists as epistasis, into gene functions. Adaptability, or evolvability,is related to mutability in rather complex ways. On the one hand, in rapidlychanging environments or in very large population sizes, mutability mightpromote evolvability by generating a sufficient number of novel variants tospawn an adaptive lineage. In slowly varying environments, or in smallpopulations, redundancy will often be favored as it can promote efficientexploration of sequence space. The choice of penalty for overlapping genesshould therefore consider the evolutionary context. If we only consider thecase where redundancy would be favored, then the cost of overlap, defined interms of the average degree of mutability, will be negatively correlated withredundancy. Approximately, redundancy is the inverse of mutability, whichwe might define as R�i ¼ 1 � pðF�iÞ.

Efficient Processing Through Translational Coupling

Overlapping reading frames can also serve to expedite efficienttranslation. Rather than thinking about overlap exclusively as a means ofincreasing rates of replication and minimizing mutation load, we can thinkabout overlap in terms of bringing neighboring genes into contact withtranslational machinery to ensure some form of coordinated or regulatedexpression. This is often achieved by fusing the termination codon of onegene with the initiation codon of another; this strategy, known as translationalcoupling, is exploited both by bacteria and phages (Oppenheim and Yanofsky1980; Madison-Antenucci and Steege 1998; Schumperli et al. 1982).

The bacterial trp-operon is a repressible operon in which tryptophanacts as a corepressor. With low levels of tryptophan there is no repressionand the operon is transcribed. With high levels of tryptophan the operon isin the off state and there is no transcription. The trp operon consists of aset of structural genes that are expressed as a polycistronic mRNAtranslated into five enzymes responsible for the synthesis of tryptophan.The genes (trpE through trpA) are for the five enzymes that catalyze thesynthesis of the amino acid tryptophan from chorismic acid. The codingsequences of all but two genes, trpC and trpE, overlap by severalnucleotides. The termination codon of trpE overlaps with the initiation

Evolutionary Principles of Genomic Compression 231

Page 18: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

codon of trpD by 29 nucleotides. Premature termination of trpE reducesthe rate of expression of trpD. The putative roles of translational couplingare to (a) reduce the initiation time delay attendant upon tRNA andribosome binding to the mRNA transcript and (b) modulate the differencein the concentration of trpE and trpD. Overlap can therefore serve twopurposes: decrease the total genome size and improve the regulation ofcoordinately expressed genes.

This idea can be clarified by analyzing a simple kinetic model. I shall notmodel the negative regulation of the trp operon, but instead assume that it isconstitutively expressed. Suppose that [R] is the concentration of the fullpolycistronic trp operon RNA, and C the joint ribosome tRNA complex. Thetranslation of the mRNA of genes trpD and trpE, contained in thepolycistronic transcript R, begins after an initiation delay di during whichthe ribosome binds to the mRNA. Once the complex [RC] is formed, there isan elongation delay de¼N/k, describing the time required to synthesize thepolypeptides D and E. The variable N is the gene length, and k is the rateconstant of chain elongation. With overlapping genes and complete trans-lational coupling ðz ¼ 1Þ, there is translation of the E gene only once D hasbeen translated. Without translational coupling ðz ¼ 0Þ both D and E aretranslated at the same rate. The kinetics of translation can be described by:

R þ C !diRC ð29Þ

RC !de D

2 � zþ ð1 � zÞ E

2 � zþ R þ C

� �ð30Þ

RC !2de zE

2 � zþ R þ C

� �ð31Þ

The kinetic relations describe how [R] binds to [C] to form the translationcomplex [RC]. After an elongation delay de the proteins D and E areproduced through independent ribosome binding with a probability 1 � z.After a further delay 2de, the protein E is produced through translationalcoupling of the trpD and trpE genes with a probability z. The differentialequations describing these kinetics are:

½ _R� ¼ 1

2de

½RC�zþ 1

de

½RC�ð1 � zÞ � 1

di

½R�½C� ð32Þ

½ _RC� ¼ 1

2de

½RC�z� 1

de

½RC�ð1 � zÞ þ 1

di

½R�½C� ð33Þ

½ _C� ¼ 1

2de

½RC�zþ 1

de

½RC�ð1 � zÞ � 1

di

½R�½C� ð34Þ

232 D. C. Krakauer

Page 19: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

½ _D� ¼ 1

de

½RC� 1

2 � z� g1½D� ð35Þ

½ _E� ¼ 1

2de

½RC�z 1

2 � z� ð1 � zÞ 1

de

½RC� 1

2 � z� g2½E� ð36Þ

After solving for the equilibrium concentrations of D and E, which we denoteas D and E, we find that we can express E in terms of D as

E ¼ Dg1ð2 � zÞ2g2

ð37Þ

with obligate translational coupling ðz ¼ 1Þ,

E ¼ Dg1

2g2

ð38Þ

and with independent translation ðz ¼ 0Þ,

E ¼ Dg1

g2ð39Þ

Thus, translational coupling adds a degree of control over the relativeconcentrations of coupled proteins by harnessing the translational delay as amechanism for promoting variation in gene products. Assuming that D and Edecay at the same rate, D can be more abundant as a result of a smaller delayin production.

A mechanistic derivation for the delay term ðdeÞ was provided in the firstsection of this article as

de ¼Xm

j

ðcj fjÞ ð40Þ

where the m � 1vector~c specifies the abundances of each of the m codons inthe mRNA, whereas the average time it takes for a codon j to be matched isdescribed by the vector~f . Thus the degree of redundancy or degeneracy in thetranslational apparatus, or in the codon composition of the translated genes,further modifies the rates of protein production through translationalcoupling. If the two genes differ in their codon compositions, then we writethe translational delay terms as dðEÞe and dðDÞ

e , and the equilibriumconcentration of E in terms of the equilibrium concentration of D is given by

E ¼ Dg1ðdðEÞe þ dðDÞe � dðEÞe zÞ

g2ðdðDÞe þ dðEÞe Þ

ð41Þ

Evolutionary Principles of Genomic Compression 233

Page 20: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

The introduction of coupling brings with it a variety of mechanisms forregulating protein expression. Thus, provided there is some degree oftranslational coupling ðz > 0Þ, codon composition, relative gene length, andthe decay rates of the respective protein products all play an important role inequilibrium protein concentrations. Compression of the genome serves toboth minimize the nucleotide content and increase the regulatory possibilities.

Translation Compression in Ciliates

In the previous subsections I introduced compression as a means of (1)reducing the genome size in order to facilitate rapid replication, and (2)increase the regulatory efficiency of translated genes. In other words,compression works to reduce the size of heritable messages and to reduce thedistance between interpreted messages. The former provides an advantage ingene replication and the latter in gene expression. With the example oftranslational coupling, both forms of compression are realized in the samesystem. We now introduce another example in which translational efficiencyand replicative efficiency are correlated. This is the case of gene scramblingin hypotrichous ciliates (Prescott 1997).

Hypotrichous ciliates possess two nucleic acid nuclei: a micronucleus,which acts as the germ line, and a macronucleus, which acts as a somaticnucleus. From one generation to the next only the micronucleus intransmitted, whereas only the macronucleus is transcribed. This effectivelydecouples replicatory compression from regulatory compression. Followingmating ciliates are diploid with respect to the macronucleus. During theformation of a macronucleus from one of the micronuclei a number of eventstake place: (1) noncoding regions of DNA present in the micronucleus areexcised, (2) spacer DNA is removed, (3) gene-coding regions of themicronucleus are spliced to form complete genes, and (4) DNA encodinggenes in the micronucleus is amplified by up to three orders of magnitude toincrease the gene copy number in the macronucleus. The developmentalprocess from micronucleus to macronucleus thus unscrambles genes,removes all noncoding regions, and increases gene numbers. Redundanciesare introduced into the macronucleus in order to increase the net rate oftranslation.

From the perspective of genomic compression, hypotrichous ciliatesmanage to achieve an almost ideal balance: Gene numbers are kept low in thereplicating micronucleus, whereas gene numbers are multiplied in themacronucleus to allow for translation in parallel.

SUMMARY

Differences in genome size represents one of the most varied measures ofdiversity among organisms. Small genomes are often favored in order to (1)

234 D. C. Krakauer

Page 21: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

promote the stable propagation of information through a reduction inmutation load, (2) promote the rapid propagation of information through areduction in nucleotide content, and (3) promote an efficient processing ofinformation by minimizing transcriptional and translational delays. All ofthese pressures favoring smaller genomes are more pronounced in thesmallest genomes. This is because in organisms with small genomes,genome size leads to significant kinetic bottlenecks influencing viability.Thus we expect small genomes to get smaller. Because the pressure forcompression is less severe on larger genomes, other factors such asnucleoskeleton and drift will dominate. This predicts a bimodal distributionof genome sizes.

Compression can be achieved through greater redundancy in thetranslational apparatus, through overlapping messages in the DNA or RNAsequence, and through a reduction in the length of translated sequences.Compression can also lead to greater coordination in protein production bycoupling the translation of functionally related genes.

As in compression in information theory, compression in biologicalsystems is often the result of the elimination of redundancies. Unlike ininformation theory where redundancy increases the reliability of messages,the heritable stability of biological messages can be increased by eliminatingredundancies. This is a result of increasing the competitive superiority ofwildtype sequences. This is a strategy that is only favored in largepopulations. In small populations, concordant with information theory,message heritability is increased through increases in reduandancy.

REFERENCES

Anderson, Kurland, C. G. 1998. Reductive evolution of resident genomes. Trends Macrobiol

6:263–268.

Beadle, G. W., and E. L. Tatum. 1941. Genetic control of biochemical reactions in Neurospora.

Proc. Natl. Acad. Sci. USA 21:499–506.

Cavalier-Smith, T. 1982. Skeletal DNA and the evolution of genome size. Annu. Rev. Biophys.

Bioeng. 11:273–302.

Charlesworth, B. 1996. The changing sizes of genes. Nature 384:315–316.

Darwin, C. 1859. Recapitulation and Conclusion. Chap. 14 in The Origin of Species. London:

John Murray.

Doolittle, W. F., and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome

evolution. Nature 284:601–603.

Eigen, M. 1971. Self-organization of matter and the evolution of biological macromolecules.

Naturwissenschaften 58:465–523.

Fox Keller, E. 2000. The century of the gene. Cambridge, M: Harvard University Press.

Krakauer, D. C. 2000. Stability and evolution of overlapping genes. Evolution 54:731–739.

Krakauer, D. C., and V. A. A. Jansen. 2002. Red queen dynamics of protein translation. J. Theor.

Biol. 218:97–109

Krakauer, D. C., and M. A. Nowak. 1999. Evolutionary preservation of redundant duplicated

genes. Semin. Cell. Dev. Biol. 10:555–559.

Evolutionary Principles of Genomic Compression 235

Page 22: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer

Krakauer, D. C., and J. Plotkin. 2002. Redundancy, antiredundancy and the robustness of gen-

omes. Proc. Natl. Acad. Sci. USA 99:1405–1409.

Lewontin, R. C. 1970. The units of selection. Annu. Rev. Ecol. Syst. 1:1–18.

Madison-Antenucci, S., and D. A. Steege. 1988. Translation limits synthesis of an assembly-

initiating coat protein of filamentous phage IKe. J. Bacteriol. 180:464–472.

Miyata, T., and T. Yasunaga. 1978. Evolution of overlapping genes. Nature 272:532–535.

Normark, S., S. Bergstrom, T. Edlund, T. Grundstrom, B. Jaurin et al. 1983. Overlapping genes.

Annu. Rev. Genet. 17:499–525.

Ohno, S. 1976. So much ‘‘junk’’ DNA in our genome. In Evolution of genetic systems, ed. H. H.

Smith, 366–370. New York: Gordon and Breach.

Oppenheim, D. S., and C. Yanofsky. 1980. Translational coupling during expression of the

tryptophan operon of E. coli. Genetics 95:785–795.

Orgel, L. E., and F. H. C. Crick. 1980. Selfish DNA: The ultimate parasite. Nature 284:604–607.

Pagel, M. D., and R. A. Johnstone. 1992. Variation across species in the size of the nuclear

genome supports the junk DNA explanation for the C-value paradox. Proc. R. Soc. Lond. B

Biol. Sci. 249:119–124.

Prescott, D. M. 1997. Origin, evolution, and excision of internal eliminated segments in germline

genes of ciliates. Curr. Opin. Genet. Dev. 7:807–813.

Schumperli, D., K. McKenney, D. A. Sobieski, and M. Rosenberg. 1982. Translational coupling

at an intercistronic boundary of the Escherichia coli galactose operon. Cell 30:865–871.

Trifinov, E. N. 1989. The multiple codes of nucleotide sequences. Bull. Math. Biol. 51:417–432.

236 D. C. Krakauer

Page 23: Evolutionary Principles of Genomic Compressiontuvalu.santafe.edu/~krakauer/Site/Publications_files/Krakauer2002d.pdfEvolutionary Principles of Genomic Compression David C. Krakauer