On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences

Compurers Chem. Vol. 17. No. 2, pp, 135-148, 1993 Printed in Great Britain

0097-8485193 %6.00 + 0.00 Pergamon Press Ltd

ON THE ROBUSTNESS OF MAXIMUM ENTROPY RELATIONSHIPS FOR COMPLEXITY

DISTRIBUTIONS OF NUCLEOTTDE SEQUENCES*

PETER SALAMON,‘~ JOHN C. WOOTTON,* ANDRZEJ K. KONOPKA~ and LARS K. HANSEN

‘Department of Mathematical Sciences, San Diego State University, San Diego, CA 92182, *National Center for EGotechnology Information, National Library of Medicine, National Institutes of Health,

Building 38A, Room 8N805, Bethesda, MD 20894, “NCI/DCBRC, National Institutes of Health, Building 1052. Room 257, Frederick, MD 21702. U.S.A. and ‘CONNECT, Electronics Institute,

Technical University of Denmark, DK-2800 Lyngby, Denmark

(Received 7 December 1992; in revised form 16 April 1993)

Abstract--Given a functionally equivalent set of natural nucleotide sequences, the distribution of local compositional complexity among all subsequences of this set appears to be as random as possible consistent with the mean complexity of such subsequences. The robustness of this relationship and its possible causes have been explored by means of (1) dynamic simulations based on models of biased substitution mutations, (2) equilibrium models incorporating known mononucleotide probabilities, and (3) extension of the analyses, previously carried out on short oligonucleotides, to much larger subsequences. The maximum entropy effect evidently follows from almost any mechanism for substitution mutation dynamics that incorporates a systematic bias toward lowcomplexity. The effect is only partially explained by unequal mononucleotide probabilities. The complexity distributions for larger subsequences of length range 40-120 nucleotides show novel regularity of structure (‘feathering’) that is not yet explained.

INTRODUCTION

The subject of this report is the local compositional complexity of nucleotide sequences. ‘Complexity’ as used here has been defined precisely for short oligonucleotides in similar ways by Konopka & Owens (1990a,b) and Salamon & Konopka (1992) and also, for amino acid sequences, by Wootton & Federhen (1993). These formal definitions (see Sec- tion 2 below) correspond to an intuitive concept of the simplicity or complexity of local segments of sequence, based on the number of ways the nucleotide composition of each segment can be rearranged. Different functional domains within genomes (coding sequences and introns from different species, other non-coding regions, repetitive DNA) display different mean values of local compositional complexity, as defined by these methods (Konopka, 1990; Konopka & Owens, 1990a,b; Salamon & Konopka, 1992). Natural amino acid sequences contain a high abundance of local segments of low compositional complexity that are statistically distinct from the typical high-complexity sequences of globular proteins (Wootton & Federhen, 1993).

* The preliminary version of this work was presented during the Second International Workshoo on Ooen Problems in Computarionnl Molecular Biol~gy,~ Tell&de Summer Re- search Center, Telluride, Colo., 19 Juiy-2 August 1992.

t Author for correspondence.

Quantitative concepts of complexity have a long history and the term ‘complexity’ has been used in several senses different from each other and from the one used in this report. Within coding theory, cryp tology and computer science the complexity of string of symbols has been defined as the length of the shortest program to generate this string [algorithmic complexity (Solomonoff, 1964; Kolmogorov, 1965; Chaitin, 1966)] or as combined amount of computer memory and computation time required to generate the string [computational complexity (Garey & John- son, 1979; Aho et al., 1974; Wagner & Wechsung, 1986)]. Other definitions of complexity take into account the intrinsic difficulty of the problem of computing the string [logical depth (Bennet & Landauer, 1985)] or even ‘inexactness’ of data used for string’s description [information-based complexity (Traub et al., 1988)]. It seems that our definition of complexity (Salamon & Konopka, 1992; Wootton & Federhen, 1993, Section 2) is closer to the concept of algorithmic complexity than to the other above mentioned notions of complexity. However, exact circumstances in which local compositional complexity can be equated with algorithmic complexity remain unknown and will require separate studies (see Wootton & Fcderhen, 1993 for a brief discussion).

As far as molecular biology is concerned, fractions of repetitious DNA are sometimes called ‘simple sequence DNA’ thereby suggesting that repetitiveness and non-repetitiveness are related to ‘simplicity’ and

135

136 PETEKSALAMON et al.

‘complexity’ respectively. Accordingly, the ‘sequence complexity’ is defined in molecular biology as the total length of different DNA sequences (i.e. non- repetitive sequences) present in a given preparation. Classical reassociation kinetics experiments provide evidence (B&ten & Kohne, 1968; Wetmur & David- son, 1968) that ‘simple sequence DNA’ covers as much as 95-97% of multicellular eukaryotic genomes, and that genes tend to reside in ‘complex sequence’ regions of these genomes. There is no obvious relation between ‘simple sequence DNA’ in the reassociation kinetics sense and the DNA regions that display low local compositional complexity in terms of definitions given in this report. Low complexity DNA sequences do not necessarily contain tandem repeats of fragments shorter than themselves. Conversely, regularly repetitive DNA may be of high local compositional complexity. Accordingly, the definitions used in this report (see also Konopka & Owens, 1990a; Salamon & Konopka, 1992), should not be confused with the above mentioned terminol- ogy from reassociation kinetics.

Salamon & Konopka (1992) analyzed surprisal (log-odds ratio) of tetra-through octanucleotides as a function of their local compositonal complexity. Each of the 35 large collections of sequences studied, displayed a strong linear trend (with negative slope values) of surprisal as a function of complexity. This trend expresses the tendency toward low complexity in a precise quantitative form. The linear trend is also predicted from a maximum entropy principle which states that the distribution of complexity among oligonucleotides of length L (referred to as L-nucleotides) is as random as possible consistent with a mean complexity. In other words, as far as revealed in the distribution of the values of the complexity, all naturally occurring distributions are well approxi- mated by a one parameter family of the Boltz- mann-Gibbs form (McQuarrie, 1976) in which the complexity plays the role of energy. Specifically, the predicted likelihood of seeing an oligonucleotide with complexity value C is proportional to exp( -PC), where the parameter B of this family is simply related to the slope of the line of surprisal complexity. This slope, or equivalently the mean complexity, provides a sensitive measure of the degree of overall bias in favor of lower complexity sequences.

In the present report, we explore this linear relationship between surprisal and complexity further. Although our main interest is the quality of the match between observed distributions and ones predicted from a maximum entropy principle, we use the convenient and standard device of comparing the distributions by plotting surprisal complexity. The goodness of fit to a linear relationship in such a plot is equivalent to a goodness of fit between the observed distribution and one predicted from a maximum entropy principle (Salamon & Konopka, 1992; Levine & Bernstein, 1974). In Section 3 we show that such a relationship follows in a robust fashion from

random dynamics models in which the transition probabilities are perturbed to favor decreasing complexity. In Section 4 we show that such relationships also follow from equilibrium models in which oligonucleotide probabilities are estimated from mononucleotide frequencies. The magnitude of the effect however falls far short of explaining the observed linear trends. Finally in Section 5 we look briefly at the distribution of complexity in much larger oligonucleotides where a host of new and interesting phenomena appear.

2. REVIEW AND DEFINITIONS

Our approach, as in Salamon & Konopka (1992) and Wootton & Federhen (1993), is to look at local complexity states by using information measures. Given an oligonucleotide of length L, we define its local compositional complexity state by the vector

v = [n, f nz, “3 I %I, (1) such that n, 2 n2 > n3 2 n4 are all non-negative integers whose sum equals L. n, is the number of occurrences of the most frequent nucleotide, n2 the second most frequent and so on. Thus (4,3,3,0) stands for 50,400 different lo-nucleotides, among them ACACACTTTT and GTAGATAGTG (see Woot- ton & Federhen, 1993; Salamon & Konopka, 1992 for more details).

From standard combinatoral arguments, the number of distinct sequences, n, that can be made by rearranging the nucleotides in a sequence with complexity state v is given by the multinomial coefficient

L!

In this report, we use the measure of complexity defined as:

K, = ; log R. (3)

We will call K, ‘local compositional complexity’ (abbreviated as ‘complexity’ in the rest of this report).

This definition is related to the one analogous to Shannon entropy:

K,= - i ;log; (4) i= I

K, and K2 converge asymptotically to the same limit for large n; (Kullback, 1959: Boulton & Wallace, 1969; see also Wootton & Federhen, 1993). All the relationships we discuss hold for both measures albeit with slightly altered constants of proportionality.

The relationship we wish to review here is between the complexity and the surprisal (log-odds ratio). The latter term is defined as log (observed frequency/expected frequency), with expected frequencies determined from some prior probability distribution. This method is a standard technique of comparison between a distribution and a prior

Robustness of maximum entropy reIationships 137

expectation. It has turned out to he a powerfu1 method for discovering systematic deviations (laws) in many contexts, e.g. for nascent product distributions in chemical reactions (Levine & Bernstein, 1974).

The standard recipe calls for a two step approach. In the first (analytic) step, surprisal is plotted as a function of some appropriate property. Observations made from such plots are then used in the second (synthetic) step to derive the most random distribution consistent with all the information available. Since explanation is rarely perfect, one can also think of the synthesis step as a means to provide a better prior distribution for the next round of comparisons.

Salamon & Konopka (1992) followed this tra- ditional recipe to examine the distribution of complexity states in 35 databases of different functional classes of naturally occuring nucleotides from different groups of organisms. The prior distribution they used was as random as possible consistent with a four-letter alphabet to make strings of length L. Thus the prior probability of seeing a lo-nucfeotide with complexity state (4,3,3,0) is 50,400/4” or about 0.05. This distribution P” is given by

P” (n,, n2, n,, n4) = CJ * F/4L (5) where the factor F counts the number of ways of assigning the four bases A, C, T, G to the ni categories.

4! F=T (6)

,GQ rk! where the r, values are the counts of the number of occurrences of each n, value in the complexity state vector (see also Wootton & Federhen, 1993; Salamon & Konopka, 1992). For the four-letter alphabet, F can take the values 24 (each rzi value occurs once), 12 (two Occurrences of the same value), 6 (two pairs of the same values), 4 (three occurrences of the same value) and 1 (four occurrences of the same value).

Then plotting the surprisal, i.e. the log-odds ratio

S(PStid IPO) = log (Pa”a’IPo), (7)

for the observed distribution as compared with the prior distribution PO, vs the complexity K, gives the linear relationship with negative slope espoused in Salamon & Konopka (1992). The linear relationship is then reinterpreted in the synthesis step as stating that Padual is the ‘most random distribution consistent with a mean value of the complexity’. Such mean values fall well below prior expectation based on PD.

3. DYNAMIC MODELS

We begin with a random dynamic model of substitution mutations and view the L-nucleotide at time step n + 1 as obtained from the L-nucleotide at time step n by a change in one nucleotide. We model this as a two step process. In the first step the nucleotide at which substitution is to take place is removed,

yielding an (L - l)- nucleotide, and then a new one is added yielding once again an L-nucleotide. For our random model we will assume that any of the nucleotides in the sequence is equally likely to be the site of the substitution, i.e. is removed with probability l/L. We further assume that any of the four possible nucleotides wil1 take its place each with probability 0.25. The resultant, highly symmetric dynamics merely gives a yardstick against which we will compare perturbations of the model. In this report, we use the word ‘transition’ to mean ‘change of state+, as in transition probabilities and associated transition probability matrices, and not to refer to transitions as opposed to transversions in the jargon of substitution mutations.

We view the process at the level of complexity states, i.e. in a lumped model whose states are exactly the compositional complexity states. More specifically, we start from a Markov process at the level of complexity states by combining into one ‘lumped state’ all sequences with the same complexity state vector. A transition into a lumped state occurs whenever the underlying process makes a transition to a sequence belonging to that lumped state and a transition from the lumped state occurs with a probabiltiy averaged over the initial state (Kemeny & Snell, 1959). While in general, lumped models do not accurately describe the dynamics far from equilibrium, they do predict the correct stationary distribution at the lumped level. They also provide a good approximation for dynamics near equilibrium (Andresen et al., 1988). Our present use for these lumped transition probabilities will only involve the predicted equilibrium distributions. One advantage of our random dynamics model is that the transition probabilities are highly symmetric. This ensures that the sequence to sequence transition matrix is exactly lumpable to the level of transitions between complexity states. While a less symmetric set of transition probabilities at the sequence to sequence level will no longer result in a process which is exactly lumpable, we stress that the resulting lumped matrices will still predict the correct equilibrium distributions for the process. This follows from the general proofs (Kemeny & Snell, 1959) demonstrating that lumped versions of Markov chains have distorted dynamics but have stationary distributions that are exactly lumped versions of the stationary distributions of the unlumped chains.

We will model real substitution mutation probabilities as perturbations on our random dynamics model. Our present use of the perturbed models is exclusively for the purpose of studying their predicted equilibrium distributions. The model does not assume any prior knowledge of mutation probabilities and thereby can he used to study the effects of simulated mutational bias on the predicted equilibrium.

We now illustrate the calculation of transition probabilities between complexity states for our random dynamics model. Each of the two steps requires

138 F%mz SALAM~N er al.

Table I. Transition probability matrices at the complexity state level for substitution mutations at the sequence level according to random

dynamics for window size L = 5

the “sly-substih&n-site.” matrix

wum 1 0 0 0 0 (4.1 ,W) l/5 4/5 0 0 0 (3,2,0.0)

i I

0 2I5 3n 0 0 (3,LLO) 01/303/50 (2.2.m 0 0 l/s 4l5 0 (2.1.m 0 0 0 3l3 20

the “select-identity-of-tbesubstituent” matrix

(4,ww)

[

114 3f4 0 0 0 0 (3,Ltw 0 114 l/4 2f4 0 0 GGu.LO) 0 0 VI 0 2/4 0 (2.1.1.0) 0 0 0 114 2/4 l/4 t1,1,1.11 0 0 0 0 0 1 1

Rows of the matrices correspond to initial complexity states of the transition (labeled). The two matrices represent halves of the process: the overall transition probability matrix is given by their product.

a separate transition probability matrix. The example matrices for L = 5 are shown in Table 1. Consider removing a random nucleotide from a 5-nucleotide of type (3,2,0,0). Such removal corresponds in biological terms to selection of the nucleotide at which substitution will take place. On removal of the nucleotide to be substituted, the S-nucleotide of type (3,2,0,0) can only change to a 4-nucleotide which belongs to one of the two types: (2,2,0,0) or (3,1,0,0). Within the random dynamics model, the probability that the removal results in (2,2,0,0) is 5, while the probability that the removal results in (3,1,0,0) is 3. The actual probabilities of getting these two 4-nucleotides may be affected by bias in our choice of the substitution site. Equivalently, the actual probabilities can be affected by different sequences belonging to the initial (3,2,0,0) complexity class making the two possible transitions with different probabilities. Such biases cannot however change the fact that these are the only two possible results on removal of one nucleotide. This is perhaps more evident by examining a representative sequence of the state (3,2,0,0) such as AAACC. When the selection of a substitution site is made, the remaining sequence can only have compo- sitions AACC or AAAC. The example also illustrates our reference to the transition probabilities within the random dynamics model being sufficiently symmetric to make lumping exact. We see that regardless of which representative (say) AAACC or TTTGG we pick, the likelihoods of changing to a state of type (2,2,0,0) or (3,1,0,0) within the random dynamics model are 3 and 3 respectively. We also note that this is no longer the case if the removal of an A from AAACC does not have the same probability as removal of a T from TTTGG in which case the dynamics is no longer lumpable.

The second step is then to add a nucleotide, i.e. to

select the nucleotide to substitute. This time we have four equally likely choices. Starting from (2,2,0,0) we go to (3,2,0,0) with probability $ if we pick one of the two nucleotides already doubly represented in the 4-nucleotide and we move to (Z&2,1,0) with probability $ if we pick one of the two nucleotides not previously represented in the 4-nucleotide. Similarly, (3,1,0,0) moves to (4,1,0,0) with probability a, to (3,2,0,0) with probability f and to (3,1,1,0) with probability f. In this manner, we compute the two transition probability matrices shown in Table 1. The reader is encouraged to try this and obtain some of the other entries shown in the table. The product of the ‘select-substitution-site’ matrix and the ‘selcct- identity-of-the-substituent’ matrix is the overall transition probability matrix at the complexity state level for the substitution mutation at the sequence level.

We begin our study of dynamics by perturbing the transition matrices form this random dynamics process whose stationary distribution is exactly the distribution PO. The question we ask is: how special is the linear relationship between S(PIP”) and K,? Does such a relationship hold for the stationary distribution of a perturbed version of the Markov chain given by the above random dynamics model? We explore the question by altering the transition probabilities in the matrix slightly and plotting the log-odds ratio, S(PwlPo), complexity K,.

The perturbations we use are slightly special in two important ways. First of all, we restrict ourselves to perturbations which preserve the zero/nonzero structure of the transiton matrix. As argued above, this structure reflects the possible moves among complexity states which result by removing exactly one nucleotide and replacing it by another, i.e. by substitution mutations at the sequence level. While it is our intent to study the effects of bias in this operation, such bias will still keep zero entries equal to zero and only affect the values of the nonzero entries. The second restriciton is that our perturbations must result in a stochastic matrix, i.e. must conserve probability. Thus if we increase one entry in a row by an amount x, then some other entry must be decreased by the same amount to keep the row sums equal to one. Also, all entries must remain nonnegative. To illustrate, these restrictions, consider again the addition of a nucleotide to one of type (2,2,0,0). If we are biased toward decreasing complexity, we may be more likely to add a nucleotide already represented in the sequence and the probability of a transition to (3,2,0,0) would be ri. But we cannot make this probability more than 50% without making the probability of picking one of the non-represented nucleotides 4 50%. This illustrates the restriction that we conserve probability. The restriction that all entries remain non-negative is also reasonable. De- creasing an entry means choosing the move that it represents less often. Since a substitution cannot he made more seldom than never, i.e. have probability

Robustness of maximum entropy relationships

Table 2. The ‘select-substitution-site’ matrix for L = 8 showing the sign pattern of perturbations biased toward decreasing complexity

0 0 0 0 0

y Q 0 0 0

z+xz 8 q 0 0 0

&$% 0 y 0 0

0 y 0 Z% 8

0

0 l+nJ 2 8 8

0 JZS 8

0 3+xg 8

Q 0

0 0 1 0

0 0 F ;

0 0 0 4+x1 8

0 0 0 2+x9 8

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

i & 0

0 $ & 4

1+x170 4 0

0 I+xlsrJ 4

0 0 1+x19 4

0 0 0

0 0 0

0 0 0

0 0 0

0

0

0

0

0

0

0

0

f

0

0

0

0

0

0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

?zL Q 8

0 0 0

0 0 0 0 0

0 4-x? 0 0 8

0

0 0 !I!5 0 8 0

s o 0 * 0

0 2+xlo6_xlo 0 8 8

0

0 2+x110 L?ZU 8 8

0

0 0 l+xlt4 8 8

S$u

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

9 a 0 0 0

: *To 0

0 0 3-x19 o 4 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

2+x200 4

0 1 4

LY.T?%2 0 4

0

0 LQLt.0 4 s O

+L 0

0 0 I+x??O a 4 $ 4

0

0 0 0 0 0 3+xua 4 4

139

The entries for x = 0 reduce to the L = 8 versions of the matrices shown in Table I. A positive value for any of the perturbation parameters x, results in enhanced transitions to states of lower complexity.

zero, non-negativity of the entries is a reasonable constraint.

To illustrate the resulting linear trends better, we present the effects of perturbation studies on the random dynamics model for L = 8. This corresponds to 15 complexity states and hence 15 points to fit or not fit a straight line. We have parametrized the perturbations with an eye to seeing an effect of bias based on change in complexity. Accordingly, we have added a term +xi to transitions which decrease the complexity the most among possible transitions and a compensating term --xi to terms which increase complexity the most. We have left unperturbed the

rows with only a single entry (necessarily equal to 1). This is again appropriate. Taking the first row of the first L = 8 matrix in Table 2 as an example, removing a nucleotide from a sequence of eight identical nucleotides leaves seven identical nucleotides regardless of any bias which may be present. This is reflected in the I,1 entry being equal to one and all other entries in the first row equalling zero.

Taking into account the provisos above, we exam- ined thousands of randomly and systematically generated perturbations. Provided only that the perturbations respect the sign pattern of a bias in complexity (see Table 2) we find an approximate

140 PEI-ER SALAM~N el al.

linear relationship between the surprisal of the it without risking negative entries. A similar exper- stationary distribution relative to P* and the corn- iment with x,s in the range [- RandScale, Rand- plexity. Figure l(a) shows a histogram of the R2 Scale], i.e. not respecting the sign pattern imposed in values which result from 1000 line fits of S(P’91P’) vs Table 2, yields the histogram in Fig. l(b). K,. Each of the 23 xis in the table were treated as The range of perturbations simulated in these uniform random variables in an interval experiments is very broad and encompasses the span [O,RandScale]. The experiment was repeated 100 of probable effects of some known mutational bias times for each value of RandScale = 0.1, 0.2, . . . , 1 .O. mechanisms in addition to many biologically unreal- Note that RandScale = 1 is as large as we can make istic mechanisms. An example of a mechanism that

BIASED

PERTURBATIONS

(4 250,

i 0.9 I

0.95

Corellation coefficient. Log-odds ratio vs Complexity

(b) 900

800=-

700-

600-

UNBIASED

PERTURBATIONS

500-

400-

300- _

200- - ‘-LI_

lOO- II I rtrrrr

0’ IIIIIIIIIIIIIIIIII1IIIII~ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Corcllation co&?cicnt, Log-odds ratio vs Compl&ty

Fig. 1. Histogram of the R2 values resulting from the line fits of the log-odds ratio surprisal .S(P~““‘b”“mlPO) vs complexity K, from perturbations of the random dynamics model for L = 8. (a) The results from 1000 perturbations whose sign pattern obeys the bias toward lower complexity; (b) shows

the analogous results for perturbations of random sign.

Robustness of maximum entropy relationships 141

would be relatively closely modeled by this approach is neighbor-dependent bias in substitution mutations, as observed in mammalian pseudogenes (Li et al., 1984; Blake et OZ., 1992). This bias generates a much higher frequency of G-A and C-+T than other substitutions, but this frequency is strongly influ- enced by the base pairs neighboring the substitution sites. Further work is required to develop perturbation models that are constrained exactly to represent the dynamics of known mutational biases such as these. Even though the transition matrices used for this report do not have a formal one-to-one correspondence with particular sequence-specific mutational mechanisms, we are confident that the domain of perturbations explored in these simulations has more than covered the foreseeable effects of real biases. Thus we draw the tentative conclusion that almost any mechanism for the dynamics which incorporates a systematic bias toward simple sequences will lead to distributions of the sort observed in naturally occurring genome data.

4. STATIC MODELS GIVEN FURTHER INFORMATION

In the present section, we examine a modification of our random dynamics model which incorporates additional information. Specifically, we now compare our distribution P” to the multinomial distribution P’ which is the most random distribution consistent with given overall frequencies of the four nucleotides which we interpret as probabilities. Thus in a database with values of the individual nucleotide probabilities given by PA, PC, PG, and PT, the prior expectation of seeing the complexity state vector (4,3,3,0) becomes

P’(4,3,3,0) = R(4,3,3,0) [Pi PC P,$ + P”A PC P:

+ Pi P: P:, + P&Pi P:

+P;‘,P:P;+P:P:P;+P$P:P:

+P~P:P:+P~P:,P:.+P~P:Pb

+ P: P: P; + P’: Pi P&l, (8)

Of course, for P, = PC = PG = PT = a, P’ = PO. We make two observations regarding the distribution PI. The first is that for reasonable values of the prior probabilities PA, PC, PG, and PT, plotting the surprisal S(P’IP’) vs complexity again results in points which fit a straight line with RZ values in the 0.9-l range. A sample plot for PA = PC = 0.3 and PC = Pr = 0.2 is shown in Fig. 2(a). Even for less reasonable values of the prior probabilities the trend is quite good. Figure 2(b) shows the plot with PA =0.7 and PC = PC = PT = 0.1 for which the R2 value was 0.846. While this would at first glance explain the findings in Salamon & Konopka (1992), on closer examin- ation we see that the slopes of these relationships are much too small to account for the actual slopes

observed by a factor which ranges from 2 to 10 for the 35 databases considered in that paper.

The fact that S(P’jPo) is approximately linear in the complexity is not completely unexpected. The mean complexity K, of the complexity state distribution for actual nucleotide probabilities must be lower than for the case of uniform distribution over nucleotides due to the higher frequency of some of the bases. The surprising part is that this additional information is ‘balanced’ with regard to keeping the distribution of complexity states as random as possible. In other words, this distribution is approximately the distribution with maximum entropy consistent with the decreased value of the average complexity.

Our second observation is that plotting the surprisal of actual distributions of complexity states relative to P’, that is, S(PE’Ua’IP‘) complexity, again results in excellent fits to a straight line but with altered slopes. As pointed out by Salamon & Konopka (1992), the slopes of such relationships can be used for sequence classification.

We conclude this section by remarking that one could construct dynamic models based on prior distributions of P’ in the manner ofthe previous section. The simplest modification of our random dynamics model which would achieve this is to keep completely random the first step which chooses the nucleotide at which substitution is to take place but to choose the replacement nucleotide according to the distribution PA, PC, PO, and PT, rather than the distribution f $, $, f. While for an accurate representation of the dynamics, we can only lump to the level of compositional states (nn, nc, no. nT), we can lump further to the level of complexity states for purposes of comparing the lumped matrices to the ones given by the random dynamics model in the previous section. In fact, this allows us to view a deviation of the mononucleotide probabilities PA, PC, Pti, and PT, from the distribution $, $, f. $ as a perturbation of the random dynamics process described above. On gener- ating these lumped matrices (so far only for small window sizes), we find that the differences in the transition probabilities have complete agreement with the sign pattern of the perturbations biased toward lower complexity. (For L = 8, this sign pattern is shown in Tables 2 and 3.) In other words, changing the mononucleotide probabilities is a possible ‘mechanism’ for bias toward lower complexity states in the sense of the previous section, i.e. insofar as transitions toward lower complexity states are enhanced and transitions toward higher complexity states are reduced relative to the random dynamics.

We note also that the above analysis already guarantees that perturbations of dynamic models based on individual nucleotide frequencies must again result in similar outcomes. More specifically, perturbations which represent a coherent bias toward lower complexity will tend to appear close to linear in a surprisal vs complexity plot. The similarity to the

142 Pm-m SAL.~MON ef nl.

0.6

R squared = 0.9874 Slope = -1.021 Probs = 0.3. 0.3.0.2,0.2

-v. _

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Complexity

(b)

R squared = 0.8461 Slope= -10.42 prObsrO.7, 0.1,O.l.O.l

2-

O-

-2 -

0 -4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Compkxity

Fig. 2. Surprisal complexity plot of S(Pmu”inom’dJP’) vs complexity where P m”“inemia’ is the distribution of complexity states given the individual mononucleotide probabilities. (a) The results for PA = PC = 0.3.

PC= PT = 0.2. @) The results for P, = 0.7, PC= PO = Pr = 0.1.

behavior in Section 3 follows since perturbations of the modified dynamics may be viewed as perturbations of perturbations of the random dynamics model. There is a slight difference since the modified dynamics already represents a perturbation ‘in the right direction’ and thus could cancel in part other perturbing influences which do not show a bias toward decreasing complexity.

5. LARGER WINDOWS To explore further the distribution of complexity

states and the robustness of the linear relationship between log-odds ratio S(P”‘““‘IP”) and complexity, K,, we have computed these relationships for much larger window lengths, 40 (632 complexity states), 80(4,263 states) and 120(13,561 states). The motiv- ation was that the very much larger numbers of


(b)

0.4 0.6 Complexity (L=l20, N=4)

0.8

0.84 0.86 0.88 0.90 0.92 0.94 0.96 Complexity (L=120, N=4)

Fig. 3(a,b). Caption overbzf

F%ER SALAMON et al.

“.,, _.. I . . * . . . . - . . _ - . - - - - - - - - - - - - - - - - -

I

0.0 0.2 0.4 0.6 0.8 Complexity (L=l20, N=4)

Fig. 3. Surprisal and log(frequency) plots for the GenBank Update data set of nucleotide sequences at window length L = 120. Each point on these plots represents one of the 13,561 complexity states of the four letter alphabet at L = 120. (a) Log-odds ratio (surprisal S(frequency observed[P)) vs complexity K,. (b) enlarged part of the plot in (a) showing only the high-complexity range. (c) Log(frequency observed)

complexity K,

states, compared with those for short oligonucleotides, might reveal something about the nature and magnitude of the deviations from linearity.

Using the methods defined in Section 2 and in Salamon & Konopka (1992) and Wootton & Feder- hen (1993), we computed the theoretical prior probabilities, PO, for uniform nucleotide frequencies of all complexity states for windows 40, 80 and 120. The observed frequencies of complexity states were computed for a large cross-section of recently determined nucleotide sequences, and for a first order random shuffle of this sequence set. This sample comprised all updates to the GenBank database entered between 16 June and 12 August 1992. Sequences containing letters other than A, C, G and T werre removed before this analysis. The resulting database, denoted ‘GenBank Update’ in this paper, contained 7318 sequences, 11,549,695 residues, from a wide range of species, lengths and functional classes. ‘Shuffled Gen- Bank Update’ is a first-order random shuffle of this database, that preserves the mononucleotide frequencies and exact set of sequence lengths of GenBank

Update, constructed as described for shuffled protein databases (Wootton & Federhen, 1993). A randomiz- ation of the total 11,549,695 nucleotides was first generated, then this string was segmented into the exact lengths of the 7318 original sequences, preserv- ing the same order of sequence lengths. This pro- cedure ensures that sequence end-effects are the same for analyses of both GenBank and Shuffled GenBank.

The plots (Figs 3 and 4) show remarkable, unexpected properties. For GenBank Update, the log- odds ratio plot has a strong central linear tendency [Fig. 3(a)], as for small windows, but shows additional structure in the off-central points in the high-complexity range. This structure, which we call ‘feathering’, consists of regular parallel curves that spread laterally on both sides of the central linear region [seen clearly in the enlarged part of the plot, Fig. 3(b)]. Similar feathering is seen in corresponding plots for windows 40 and 80 (not shown), the main difference being that the number of parallel ‘barbs’ in the feathers decreases with decreasing window length.


(W

0.0 I

0.2 I

0.4 0.6 Complexity (L=120, N=4)

0.8

0.88 0.88 0.90 0.92 Complexity (L=120, N=4)

0.94 0.96

Fig. 4(a,b). Cqrion overlearf.

Pmm SALAMON et al.

0.2 0.4 0.6 Complexity (L=120, N=4)

Fig. 4. Surprisal and log(frequency) plots for the Shuflled GenBank Update sequences at window length L = 120. The plots parallel those in Fig. 3. (a) Log-odds ratio (surprisal S(frequency observedIP)) vs complexity K,. (b) Enlarged part of the plot in (a) showing only the high complexity range, which for the Shuffled data set includes all the points for which observed frequency is non-zero. (c) Log(frequency

observed) complexity K,.

The plots for Shuffled GenBank Update [Fig. 4(a, b)] do not show a clear feathering pattern, although some indistinct strcuture, in the form of clustering of points rather than regular striation, is possibly present in the complexity range 0.91 to 0.94 [Fig. 4W.

The structures seen in Fig. 3(a,b) are attributes of the GenBank Update sequence set rather than the priors PO. This conclusion is supported by Fig. 3(c) (and the corresponding plots for window lengths 40 and 80, not shown), which plots the distribution of the log (base 4) of the frequencies of the complexity states in GenBank Update. This plot shows three types of structural patterns, of which the first two follow from simple combinatorial considerations and the third is the non-trivial equivalent of the feathering in Fig. 3(a,b).

1. Rare complexity states. Horizontal lines at the bottom of the distribution in Fig. 3(c) are given by the states whose number of occurrences are distinct small integers (1,2,3 etc. from bottom up). This effect also shows clearly with Shuffled GenBank Update

[Fig. 4(b,c)] which contains the same number of residues as GenBank Update.

2. Coloring lines. The central linear tendency of the log-odds ratio plots [Fig. 3(a, b)] is split into four distinct linear regions in the log (frequency) plot of Fig. 3(c). The four (approximate) lines show as parallel concentrations of points rising steeply towards the top right of the plot. These four lines correspond to states with four different values of F (number of colorings): 4, 6, 12 and 24. We call these lines ‘coloring lines’ in this paper. There is also a single point of highest complexity [extreme right of the Fig. 3(c) plot] for which F = 1. These five values of F are the only possible numbers given by the compositional combinations of the four letter alphabet. The corresponding pattern of four coloring lines shows very clearly in the plot for randomly-shuffled GenBank Update [Fig. 4(c)] and in plots of log (P”) complexity (not shown).

Coloring lines are explained as follows: the expected probabilities split into four regions simply because F is a term in PO: P” = F52/NL, and because


complexity states that have close values of K1 have similar values of R since K, = log(f2)/L. There- fore, complexity states with different F values (1,4,6,12,24) tend to be clearly distinct from each other in the absence of perturbing effects, as seen in Fig. 4(c). In the case of the log-odds ratio, the observed frequencies and expected probabilities are both split into corresponding coloring lines, so that the effect is cancelled out in the (observed/expected) division. Therefore, only a single central linear zone is seen in the log-odds ratio plots [Figs 3(a,b) and 4(a,b)]. Distributions derived from amino acid sequences, corresponding to those in Figs 3(c) and 4(c) (Wootton & Federhen, 1993), do not show discrete coloring lines. This is because the 20-letter alphabet generates very large numbers of possible values of F and the different zones merge into each other in the plotted distributions.

3. Striation and feathering. In Fig. 3(c), parallel regular striations, which tend away from the central, approximately-linear zones, occur at complexity values between 0.78 and 0.94. These correspond to the feathering on the log-odds ratio plots [Fig. 3(a,b)]. In Fig. 3(c) they are more conspicuous on the upper left of the central density of points than the lower right. The detailed, close correspondence of these striations to the feathering also shows clearly in the plots for window lengths 40 and 80 (not shown). This effect is characteristic of the natural sequence database (Gen- Bank Update) and does not show, except possibly for some irregular, vague clustering of points, in the plot for the shuffled database [Fig. 4(b,c)]. Also, striation is completely absent from plots of log(Pp complexity (not shown).

We do not yet have any satisfactory formal or empirical explanation for the feathering effect. To what extent does this structure reflect the abundance of particular functional classes of natural nucleotide sequences, and to what extent does it arise from underlying combinatorial relationships of the probabilities of different states? Possibly the upper and lower lines of feathering [Fig. 3(b)] represent sets of related local complexity states that have mutually-compensating frequencies. In other words, the states in the upper lines of feathering, which are present at higher than average frequency for their range of complexity, might have some special numerical relationship to the states in the lower lines of feathering that have lower than average frequency. Such systematic perturbations might arise from some types of mutational bias or high abundance of certain classes of low-complexity or periodic sequences. The GenBank Update database is certainly a complex mixture of several different sequence classes with different statistical properties. Further experiments using different classes of natural and simulated sequences are required to understand these novel aspects of sequence complexity.

6. CONCLUSIONS

We have explored the robustness of the maximum entropy principle found in Salamon & Konopka (1992) three ways. We saw that the relationship follows from biased substitution mutations in a random dyanmics model. We also saw that it holds for a multinomial distribution incorporating mononucleotide probabilities. Finally, we explored the extension of the analyses, previously carried out only for short oligonucleotides, to much larger subsequences.

Our findings indicate that the maximum entropy effect evidently follows from almost any mechanism for substitution mutation dynamics that incorporates a systematic bias toward low-complexity. The fact that incorporating mononucleotide probabilities can serve as an example of a ‘mechanism’ for such bias comes as a mild surprise. While accounting for some of the decrease in mean complexity, this mechanism only partially explains the maximum entropy distributions in naturally occurring functionally equivalent sequences predicting a bias toward low complexity nearly an order of magnitude too small.

The complexity distributions for more hetero- geneous samples of larger subsequences in the length range 6120 nucleotides display the linear tendency but in addition reveal a novel regularity of structure (‘feathering’) that is not yet explained.

Acknowledgements-FS gratefully acknowledges helpful conversations with James N&on and Benjamin Felts.

REFERENCES

Aho J. D., Hopcroft J. E. & Ulhnan J. D. (1974) The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading. Mass.

Anderseni., Hoffman K. H., Mosegaard K., Nulton J. D., Pedersen J. M. & Salamon P. (1988) J. Phvs. France 49, . _ _ 1485.

Bennet C. H. & Landauer R. (1985) Sci. Am. 253,48. Blake R. D., Hess S. T. & Nicholson-Tuell J. (X992) J. MO/.

Evol. 34, 189. Moulton D. M. & Wallace C. S. (1969) J. Theor. Biol. 23,

269. Britten R. J. & Kohne D. E. (1968) Science 161, 529. Chaitin G. J. (1964) JACM 13, 547. Garey M. & Johnson D. (1979) Computers and Intractabil-

ity: a Guide to the Theory of NP-Completness. Freeman, San Francisco, Calif.

Kemeny J. 0. & Snell J. L. (1959) Finite Mark00 Chains. Van Nostrand. Princeton. N.J.

Kolmogorov A. i1965) Proi. inform. Transmission 1, 3. Konopka A. K. (1990) In Human Genome Initiative and

DNA Recombination (Edited by Sarma R. & Sarma M.) p. 113. Adenine Press, New York.

Konopka A. K. & Owens J. (1990a) In Computers and DNA (Edited by Bell G. & Marr T.) p. 147. Addison-Wesley, Reading, Mass.

Konopka A. K. & Owens J. (1990b) Gene Anal. Tech. Appl. 7, 35.

Kullback S. (1959) Information Theory and Statistics. Wiley, New York.

Le W-H., Wu C.I. & Luo C. C. (1984) J. Mol. Evof. 21, 58.

148 PETER SALAMON et al.

Levine R. D. & Bemstein R. B. (1974) Ace. Chem. Res. 7, Traub J., Wasilkowski G. & Wozniakowski H. (1988) 393. Ittf~rmd~n -Based Complexity. Academic Press, Orlando,

MoQuarrie D. A. (1976) Statbrical Mechanics. Harper Br Fla. Row, New York. Wagner K. & Weohsung G. (1986) Computurional Complex -

Salamon P. & Konopka A. K. (1992) Compur. Chem. 16. 117.

iry. Reidel, Dordrocht. Wetmur J. G. & Davidson N. (1968) J. Mol. Biol. 31, 349.

Shannon C. E. (1948) Bell Cyst. Tech. J. 27, 379. Wootton J. C. & Federhen S. (t993) Comput. Chem. 17, Solomonoff R. (1964) Inform. Control 7, 224. 149.

Documents

On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences