A discrete autoregressive process as a model for short ... · processes. In this case statistical properties of the sequences reect algorithmic proper-ties of the process. Stochastic

Available online at www.sciencedirect.com

Physica A 327 (2003) 535–553www.elsevier.com/locate/physa

A discrete autoregressive process as a model forshort-range correlations in DNA sequences

M. Dehnertb, W.E. Helma, M.-Th. H,uttb;∗aMathematics and Science Faculty, University of Applied Sciences, D-64295 Darmstadt, Germany

bBioinformatics Group, Darmstadt University of Technology, D-64287 Darmstadt, Germany

Received 12 February 2003

Abstract

We present a direct way to model short- and medium-range correlations in DNA sequences andto separate them from long-range correlations. To do so, we discuss symbol sequences generatedby a discrete autoregressive process of order p, DAR(p). These sequences display higher-orderMarkov properties but are based on very few parameters. The aim of our investigation is (1) tointroduce with such DAR(p) processes a parameter-e1cient tool for generating higher-orderMarkov processes on a discrete alphabet, (2) to study, how the parameters of the process de-termine the statistical properties of the sequence and (3) to provide the mathematical tools forestimating the parameters from a given experimental sequence. The statistical properties of thegenerated sequences, expressed in terms of parameters in the DAR(p) process, are monitoredwith methods from information theory. The implications of our 3ndings for DNA sequences arediscussed and some application is given. In particular, it is shown, how short-range correlationsin DNA sequences can be parameterised by such a process.c© 2003 Elsevier B.V. All rights reserved.

Keywords: Long-range correlation; DNA analysis; Entropy; Mutual information;Markov process of higher order; Discrete autoregressive process

1. Introduction

The systematic study of the correlation structure of DNA sequences is now alreadyseveral decades old (see, e.g., Ref. [1] for a critical review). As one of the mainresults, it is by now well established that long-range correlations in DNA sequences

∗ Corresponding author. Tel.: +49-6151-16-3202; fax: +49-6151-16-4630.E-mail address: [email protected] (M.-Th. H,utt).

0378-4371/03/$ - see front matter c© 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0378-4371(03)00399-6

mailto:[email protected]

536 M. Dehnert et al. / Physica A 327 (2003) 535–553

exist [2–4]. The precise form of such correlations, as well as their biological origin andconsequences, still are open problems. By comparing diEerences in the correlation struc-ture for biologically diEerent DNA sequences one may then relate certain forms of cor-relation to speci3c biological function. In general, long-range correlations are related tothe large-scale organisation of a DNA sequence (e.g., its mosaic structure of patches ofdiEerent compositions like coding regions or repetitive elements, see, e.g., Refs. [5,6],while correlations on a shorter scale (between a few and several hundred basepairs)contain some information on the codon structure of coding DNA sequences, as well ason the tertiary structure of the DNA molecule. Such aspects of the correlation structureare directly related to regulatory function. In particular, the 3-basepairs (bp) periodic-ity provides a useful distinction between coding and non-coding DNA sequences [7].Higher periodicities (e.g. 10–11 bp and several hundred bp) can similarly be relatedto DNA structure, namely to the alpha-helical coiling and aspects of DNA evolution[8]. Certain observations on the scale of 100–200 bp in the correlation structure ofthe DNA sequence are related to DNA bending sites [9]. The capabilities of trans-lating properties of the correlation structure into biological function are by no meansfully developed so far. A distinction and quantitative representation between short-,intermediate- and long-range correlations is an important prerequisite for this task. Aconvenient tool for parameterising short- and intermediate-range correlations in DNAsequences is given by higher-order Markov processes, where, basically, the Markovorder p can be thought of as the memory of the sequence. However, the usual wayof mathematically representing higher-order Markov processes, namely by specifyingall transition matrix elements, contains far too many parameters, even at comparativelylow Markov order, than it is possible to estimate from the sequence at hand. Othermethods, however, for example generating higher-order Markov sequences by applyinga coarse-graining algorithm to a discrete nonlinear dynamical process [10,11], lack theGexibility required for application to describing correlation in DNA sequences. Olderapproaches to this problem (see, e.g., Refs. [12,13]) essentially reduce the number ofparameters by making additional assumptions relating matrix elements. The remain-ing number of parameters, however, is still too large for e1ciently estimating themfrom biological data. In the present paper we adapt a mathematical model [14,15] forhigher-order Markov processes with very few parameters to the parameterisation ofshort- and intermediate-range correlations in DNA sequences. The model is an autore-gressive process operating on a discrete alphabet. We introduce this discrete autoregres-sive process of the order p (DAR(p) process) in four steps. First, we mathematicallyformulate it in terms of an algorithm, which allows simulation of symbol sequences.Second, the role of the diEerent parameters is studied using tools from informationtheory. For some special cases we have been able to derive analytical expressions forthese relations. Third, an estimation theory for such discrete autoregressive processes isformulated. And 3nally, fourth, we give an explicit example of how a parameterisationof a real DNA sequence is capable of accounting for short- and intermediate-rangecorrelations. In particular, it is seen that with this method it is possible to 3t a processof a Markov order as high as 30 to a given sequence. Being able to handle such highMarkov orders constitutes a notable advantage compared to other more conventionalformulations of Markov processes.

M. Dehnert et al. / Physica A 327 (2003) 535–553 537

2. De�nitions

2.1. Stochastic processes and the Markov property

We discuss the statistical properties of symbol sequences, where each symbol is takenfrom an alphabet A. In the case of DNA sequences the alphabet consists of the fournucleotides A, C, G and T. Such symbol sequences can also be generated by stochasticprocesses. In this case statistical properties of the sequences reGect algorithmic proper-ties of the process. Stochastic processes are families of random variables generated byprobabilistic laws. The stochastic process {Xt; t ∈N}, where we refer to the parametert as time, is called a discrete-time process with the state space A [16]. In the following,the state space refers to an alphabet consisting of � letters A = {a1; a2; : : : ; a�} with�∈N, �¿ 1.

Let c1; : : : ; cn denote a substring of length n of an in3nite symbol sequence givenby a discrete-time process with the alphabet A. Here we assume a stationary stochas-tic process, where the probability of 3nding in the full (in3nite) sequence the block(c1; c2; : : : ; cn)∈An does not depend on the position at which the symbol segment ap-pears in the sequence i.e., P(Xt+1 = c1; : : : ; Xt+n = cn) = P(X1 = c1; : : : ; Xn = cn). In thefollowing we will use p(c1; : : : ; cn) as a shorthand notation for P(X1 = c1; : : : ; Xn = cn)except when the detailed notation is needed for clarity. For DNA sequences stationarityis a controversial assumption. Frequently one observes trends in local symbol densitiesalong a given sequence. Recent investigations even show, how deviations from station-arity can be exploited to identify some biologically relevant features of DNA sequences[17,18]. Here we checked explicitly that the most observables discussed here are robustwith respect to small deviations from stationarity.

A 3rst-order Markov sequence is de3ned by the following property: For each n∈Nand every segment (c1; : : : ; cn+1)∈An+1 of length n + 1 the conditional probabilityp(cn+1|c1; : : : ; cn) of 3nding cn+1 after observing (c1; : : : ; cn) as preceding symbolsreduces to

p(cn+1|c1; : : : ; cn) = p(cn+1|cn) : (1)

Here p(cn+1|cn) denotes the conditional probability de3ned for non-zero p(cn) as

p(cn+1|cn) =p(cn+1; cn)p(cn)

: (2)

In this sense, a 3rst-order Markov sequence is related to a one-step memory (seealso Ref. [10]). One can extend this de3nition to pth order Markov sequences byallowing the p preceding symbols to inGuence cn+1, i.e., for each n; p∈N, every(c1; : : : ; cn+1)∈An+1 and p(c1; : : : ; cn)¿ 0 one has

p(cn+1|c1; : : : ; cn) = p(cn+1|cn−p+1; : : : ; cn) : (3)

Hence, a memory of p steps can be attributed to a sequence following Eq. (3).


2.2. Discrete autoregressive processes

An e1cient method for generating a sequence of stationary discrete random variableswith Markov properties is the discrete autoregressive process of order p, DAR(p). TheMarkov properties are reGected by the fact that the distribution of Xn only depends onXn−1; : : : ; Xn−p. The process is speci3ed by the stationary marginal distribution of Xnand several other parameters which, independently of the marginal distribution, deter-mine the correlation structure of the sequence. Their respective roles will be discussedin Section 3.

The DAR(p) process is de3ned as follows [14,15]:Let A={a1; : : : ; a�} be an alphabet of � letters, �∈N, �¿ 1. Let, furthermore, {Yn}

be a sequence of independent and identically distributed (IID) random variables witha marginal distribution � where

P(Yn = ai) = �(ai); ai ∈A : (4)

Let {Vn} be an independent sequence of random variables following a Bernoulli dis-tribution, for which one has

P(Vn = 1) = 1 − P(Vn = 0) = � with 06 �¡ 1 : (5)

Let, furthermore, {An} be a sequence of independent random variables taking valuesin {1; 2; : : : ; p} with

P(An = i) = �i¿ 0; i = 1; 2; : : : ; p ; (6)

where the parameter vector �̃=(�1; �2; : : : ; �p) is normalised to unity,∑p

i=1 �i=1. Then{Xn} given by

Xn = VnXn−An + (1 − Vn)Yn for n= 1; 2; : : : (7)

is called DAR(p) process.In a less formal way the DAR(p) process can be explained as follows: The value

Xn is either taken out of the history of {Xn} (with probability �) or drawn randomlyout of the alphabet A (with probability 1 − �). In this way Vn works like a switchbetween two cases. In the case of Vn = 1, Xn is determined by going An steps back inthe history of {Xn} with An assuming values from {1; : : : ; p}, where the components ofthe parameter vector �̃ provide the corresponding probabilities. Thus, with probability��j, Xn = Xn−j for j = 1; : : : ; p. In the case of Vn = 0, Xn = Yn is drawn randomly outof the alphabet A (following the marginal distribution �).

It has been shown in Ref. [14] that it is possible to select an initial distributionwhich yields a stationary sequence {Xn} with marginal distribution �. This initial dis-tribution coincides with the marginal distribution �. DAR(1) is a class of ordinaryMarkov chains (of 3rst order) whose behaviour is determined through the stationarydistribution � and one single correlational parameter �. When appropriate it’s use isstraightforward, cf. [19].

This procedure diEers essentially from representing a Markov process by a matrixof transition probabilities. Here a small number of eEective parameters allows speci3c


regulation of certain statistical properties of the sequence, while in the case of a transi-tion probability matrix one has many independent parameters, each of which regulatesonly a minor aspect of the process. Note that by construction this process is ergodicand the generated sequences are stationary. As we will use the DAR(p) process asa model to 3t data (e.g. DNA-sequence) all these parameters need to be determinedby the data. A detailed explanation of parameter estimation for the DAR(p) model isgiven in Appendix A.

It is generally agreed that a DAR(p) process is the most parsimonious Markovmodel of order p presently being used. The class of mixture transition distributions(MTD) of Raftery is more comprehensive but requires a larger number of parameters(see Refs. [12,13]). Starting with Ref. [19] the DAR(p) processes have developedinto one of several standard tools for modelling internet tra1c, a source which is alsoknown to display long-range correlations.

3. Results

In the following we will discuss the diEerent roles of parameters entering theDAR(p) process which we study by explicit simulation of 3nite sequences on thebasis of Eq. (7). A convenient environment for studying such dependences is given bymethods from information theory.

In the last few years such methods have emerged as e1cient tools for revealing thecorrelation structure of a DNA sequence [7,20–27] and are now established enough toenter textbooks on the interpretation of biological data (see, e.g., Ref. [28]). In the 3rstpart (Section 3.1) we will focus on the conditional entropies hn, as the (hn; n)-planeprovides a convenient platform for studying parameter dependence of the DAR(p)process. In addition we discuss the parameter dependence of the mutual informationfunction. The corresponding de3nitions are summarised in Appendix B.

Certain features of sequences generated with a DAR(p) process can be comparedwith statistical properties of DNA sequences. This is described in Section 3.2. Weuse the standard frequency estimators for hn and I(k), but we omit the hat (ˆ) in ournotation. We do not apply bias corrections and intentionally chose a range of blocklengths n that display the bias, e.g. the underestimation of hn. However, our conclusionsdo not depend on the actual estimators being used, i.e., Bayesian estimators would workas well.

3.1. Dependence of conditional entropies and mutual information on the parametersof the DAR(p) process

A convenient way of studying the role of the parameter � in the DAR(p) processis to keep the Markov order constant. We 3rst discuss the simplest case of a DAR(1)process. Fig. 1 shows the conditional entropies hn for diEerent values of �∈ [0; 1).The sequence length L has been chosen to be of the order �n, where � is the sizeof the alphabet. In all cases a clear kink in each curve at n = 1 allows characterisingthe corresponding sequences as generated by 3rst-order Markov processes. We will see


Fig. 1. Conditional entropies hn as a function of n for diEerent values of �∈ [0; 1) of a DAR(1) process.Parameter values for all curves: p = 1, �̃ = (1), � = 4, �(ai) = 1=� for i = 1; : : : ; �, L = 1:05 × 106, � = 2.For the diEerent curves � takes the values 0.1, 0.3, 0.5, 0.7 and 0.9.

that a clearly visible kink at n= p in the series of hn is the most important signatureof a Markov sequence of order p, independent of the precise parameter values used togenerate the sequence. The position of the curve hn in the (hn; n)-plane, however, is notdirectly related to the Markov order. In Fig. 1 it is seen, that varying � allows sweepingthe whole (hn; n)-plane, the Markov order being 3xed. As the position of the plateau(i.e., the asymptotic behaviour of hn for large n) reGects the entropy of the source, wealso see from Fig. 1 that the parameter � determines the stochasticity of the DAR(1)process, with larger � corresponding to a more deterministic sequence and lowering� increases stochasticity. Furthermore, it is seen that the systematic underestimationof the conditional entropies at 3xed L (as given by the deviation from a horizontalline) for large n strongly depends on �. This is due to the fact that for small � (highstochasticity) the number of possible n-words actually appearing in the sequence ismuch higher than for larger �. In other words, the total number of diEerent blocks oflength n in a symbol sequence of length L based on an alphabet of � letters increaseswith decreasing �. Consequently, a 3nite L has a much stronger eEect.

For this case some of the quantities can also be obtained analytically. This point isfurther discussed in Appendix C.

As brieGy mentioned above, the most important signature of a Markov sequence ofthe order p in the (hn; n)-plane is the kink at n = p. This is seen in Fig. 2. There,the conditional entropies hn are shown for diEerent p at 3xed �=0:9. The conditionalentropy hn decays monotonously for n6p. Reaching the order p, the series of hnstays constant until it begins to decrease again due to the systematic underestimationof entropies for 3nite sample size L. Note that in this case (i.e., at 3xed �) the entropyof the source increases with the Markov order p.


Fig. 2. Conditional entropies hn as a function of n for diEerent orders p of a DAR(p) process with �=0:9.Parameter values for all curves: � = 0:9, � = 4, �(ai) = 1=� for i = 1; : : : ; �, L = 3 × 107, � = 2. For thediEerent curves p takes the values 1, 2, 3, 4, 5 and 6. In each case the vector �̃ has p components, eachgiven by 1=p.

The properties of � can to some extent be exploited to more clearly depict the serieshn for diEerent Markov orders p by allowing � to depend on p. An example of sucha parameterisation �(p) is shown in Fig. 3, namely � = 3p=20. This choice reversesthe order of the curves in the (hn; n)-plane with respect to Fig. 2. In addition, for highp the kink at n= p is more clearly visible than in Fig. 2.

In Fig. 4 the separation of curves hn at 3xed n= 4 for diEerent p as a function of� is studied more systematically. There, Mh4 = h4|p=4 − h4|p=3 is shown as a functionof �. This curve has a pronounced peak around � = 0:8, corresponding to an optimalseparation of the curves hn in the (hn; n)-plane. In general, the position of the peakdepends on p. This dependence can, in principle, be used to construct an optimalparameterisation �(p). However, here we will not pursue this point any further.

In Figs. 2 and 3 the parameter vector �̃ has been chosen to have uniformly compo-nents (e.g., �̃(3) =

(13 ;

13 ;

13

); all parameter values are given in the 3gure captions). In

general, the vector �̃ of the DAR(p) process determines both the memory time andthe memory strength of the sequence. The memory time is given by the dimension pof the vector �̃, while the memory strength is determined by the components of �̃.

Fig. 5 shows the conditional entropies hn at 3xed Markov order p = 6 for diEerentchoices of �̃(6), which are all given in the 3gure captions. The parameter � = 0:9 iskept constant. The vectors �̃(6), �̃(6)

A , �̃(6)B , �̃(6)

C correspond to our previous choice, to auniform increase in the components, to high correlations of each third symbol and to auniform decrease (in the components), respectively. In the case of �̃(6)

C it is no longerpossible to extract the Markov order p = 6 from the curve hn. In all other cases thesignature of p is still clearly distinguishable (e.g., the last plateau for �̃(6)

B ).


Fig. 3. Conditional entropies hn as a function of n with a parameterisation of �(p) for diEerent orders p ofa DAR(p) process. Parameter values for all curves: � = 4, �(ai) = 1=� for i = 1; : : : ; �, L= 3 × 107, � = 2.For the diEerent curves � takes the values 0.15, 0.3, 0.45, 0.6, 0.75 and 0.9. The parameter p takes thevalues 1–6. In each case the vector �̃ has p components, each given by 1=p.

Fig. 4. Conditional entropies Mh4 =h4|p=4−h4|p=3 as a function of �. Parameter values for all curves: �=4,�(ai) = 1=� for i = 1; : : : ; �, L= 3× 107, �= 2. For the diEerent curves � takes the values 0:0; 0:5; : : : ; 0:95.The parameter p takes the values 3 and 4. In each case the vector �̃ has p components, each given by 1=p.

In all cases considered above each symbol in the alphabet had the same probability(marginal distribution of a Laplacian form). Any other choice for the marginal distri-bution � yields a global reduction of the uncertainty in the process, thus resulting ina global shift of all curves hn to lower values (decrease of the Shannon entropy H).


Fig. 5. Conditional entropies hn as a function of n at 3xed Markov order p = 6 for diEerent choices of�̃(6). Parameter values for all curves: p = 6, � = 0:9, � = 4, �(ai) = 1=� for i = 1; : : : ; �, L = 3 × 107,� = 2. For the diEerent curves �̃ is chosen as

( 16 ; : : : ;

16

),( 1

21 ;221 ; : : : ;

621

), (0:02; 0:06; 0:42; 0:02; 0:06; 0:42)

and( 6

21 ;521 ; : : : ;

121

).

Next, we turn to the second information-theoretical observable from informationtheory introduced in Appendix B, the mutual information. Fig. 6 shows the mutualinformation function I(k) at 3xed Markov order p = 6 with our previous choices ofthe parameter vector �̃, namely �̃(6), �̃(6)

A , �̃(6)B and �̃(6)

C . Again, the parameter � = 0:9is kept constant. Some properties of �̃ are clearly seen in the mutual informationfunction. For increasing constituents of �̃(p), the series I(k) increases until reachingthe order p, after which it declines. The corresponding kink in the series of I(k) fork = 6 is a good visual indicator of the Markov order. A dominance of correlations ata distance of three symbols, as, e.g., given by �̃(6)

B , lead to a sawtooth-like series ofI(k). After reaching the Markov order p, the height of the peaks decreases visibly.For decreasing constituents of �̃(p), as in �̃(6)

C , the series I(k) monotonously declines,a visual identi3cation of the order p is no longer possible.

3.2. Application to DNA sequences

On the basis of the de3nitions given in Appendix B it is straightforward to computethe conditional entropies and the mutual information function from DNA sequences. InFig. 7 the series of conditional entropies hn for a nucleotide sequence segment (withL=28:5×106) of chromosome 21 of the human DNA is compared with the result for aMarkov sequence of the order 2. The result of the simulated sequence has been insertedto provide some reference point, clarifying to what extent the behaviour observed forthe DNA sequence can in principle be accounted for by a simple DAR(p) process.Any deviation from this simple null hypothesis can then be further interpreted in terms


Fig. 6. Mutual information I(k) as a function of k at 3xed Markov order p = 6 for diEerent choices of�̃(6). Parameter values for all curves: p = 6, � = 0:9, � = 4, �(ai) = 1=� for i = 1; : : : ; �, L = 3 × 107,� = 2. For the diEerent curves �̃ is chosen as

( 16 ; : : : ;

16

),( 1

21 ;221 ; : : : ;

621

), (0:02; 0:06; 0:42; 0:02; 0:06; 0:42)

and( 6

21 ;521 ; : : : ;

121

).

of a more complex correlation structure. The series of conditional entropies hn shownin Fig. 7 illustrates this point: Some short-range, 3nite-size correlations give rise to akink at n= 1 or 2, while on the larger scale in n the saturation found in the simulatedresult is replaced by a pronounced decrease much larger than statistically expectedfrom the length L of the DNA sequence. This decrease is related to the distribution ofrepetitive sequences [22].

Recently, Holste et al. [27] applied such methods to analyse the full DNA sequenceof the single human chromosome 22. In particular, they found power-law correlationsin the mutual information function over four orders of magnitude with respect to thebase pair (bp) distance k. Fig. 8 reproduces parts of their 3ndings by showing theseries I(k) up to k = 400. As before, our numerical studies presented in the previoussection can be thought of as a convenient background for interpreting the details ofsuch a curve. In particular, in Fig. 8 one can identify intermediate-range (from about10–200 bp) correlations. For larger k the strong oscillations of I(k) vanish, which maybe a signal of diEerent length scales contributing to the full correlation structure. Thisis in agreement with Audit et al. [9], who performed a wavelet transform analysisof DNA segments characterised by speci3c structural properties (e.g., bending sites).They 3nd two correlation regimes, where the smaller one has a characteristic scale of100–200 bp.

As before, the curve for the DNA sequence is complemented by a simulated one.In this case, however, we are applying the estimation theory outlined in Appendix A.

We chose p= 30, in order to have the same order of magnitude for the 3rst regimeof correlations as found experimentally. This example illustrates, how a DAR(p)


Fig. 7. Conditional entropies hn as a function of n for a segment of the human chromosome 21 (taken fromhttp://www.ncbi.nlm.nih.gov/genome/guide/human/, unknown nucleotides have been omitted) and a sequenceof Markov order p= 2. Parameter values for the curve: p= 2, �̃= (0:66; 0:34), �= 0:2, �= 4, �(a1) = 0:35,�(ai) = 0:65

3 for i = 2; : : : ; 4, L = 2:5 × 107, � = 2.

Fig. 8. Mutual information I(k) as a function of k of the human chromosome 22 taken fromhttp://www.sanger.ac.uk/HGP/Chr22/, unknown nucleotides have been omitted.

process can be used for separating Markov and non-Markov contributions to the ob-served correlation structure. More precisely, it can be seen, what aspects of the func-tion I(k) can be accounted for by a DAR(p) process and what aspects have to beattributed to other mechanisms (e.g., to truly long-range correlations). Comparing thetwo curves in Fig. 9 one can see that an adequate approximation of the correlation

http://www.ncbi.nlm.nih.gov/genome/guide/human/

http://www.sanger.ac.uk/HGP/Chr22/


Fig. 9. Mutual information I(k) as a function of k of the human chromosome 22 and a sequence of Markovorder p = 30 with parameters obtained by the estimation theory given in Appendix A and an estimatedmarginal distribution (computed from the chromosome 22). Parameter values for the curve: p = 30, �̃ =(0:093859; 0:081318; 0:042895; 0:05427; 0:043915; 0:073231; 0:039743; 0:061165; 0:052678; 0:043788; 0:010774;0.041955, 0.030217, 0.0266, 0.01033, 0.027503, 0.01058, 0.022885, 0.017494, 0.014203, 0.0077855, 0.02214,0:0069881; 0:044579; 0:028879; 0:015222; 0:021023; 0:0085445; 0:016354; 0:029082), � = 0:5, � = 4, �(a1) =0:26155, �(a2) = 0:23904, �(a3) = 0:23897, �(a4) = 0:26044, L = 3:35 × 107, � = 2.

structure of a DNA sequence by a DAR(p) process is possible in the regime of3nite-range correlations.

It is interesting to note that in most cases an attempt to exactly reproduce theproperties of the mutual information function (e.g., the period-3 oscillation) leads tostrong discrepancies for the sequence of conditional entropies and vice versa. The twoquantities thus address very diEerent properties of the sequence and, in an analysisattempt of real data, complement each other e1ciently.

The simple mathematical model of higher-order Markov processes, which we rep-resent in terms of a DAR(p) process, allows us to distinguish between 3nite-rangecorrelations (of length p) and truly long-range correlations (as, e.g., given by correla-tions decreasing as a power law ˙ d−! with distance d).

4. Conclusions

With the help of tools from information theory we studied, how parameters of adiscrete autoregressive process of the order p inGuence the statistical properties ofsimulated symbol sequences. In particular, we analysed, how the conditional entropieshn for word length n and the mutual information I(k) as a function of the inter-symboldistance k depend on the parameters of a DAR(p) process. Such a process may serveas a model for studying short-range and intermediate-range (but 3nite) correlations in


symbol sequences. In particular, we found, (1) how the entropy of the source changeswith the stochasticity in the sequence, (2) how robust the main signature of the Markovorder p in the (hn; n)-plane, namely the clear kinks at n=p, are with respect to variationof local correlation structure and stochasticity and (3) what local correlation structurecan reproduce the fast oscillations in the mutual information function found whenanalysing real DNA sequences.

In an attempt to relate these 3ndings to experiment we reproduced parts of a re-cent result in Ref. [27] and study the result with the help of the statistics-signaturerelations established above. In particular, we use a parameter estimating procedure toderive some of the parameters of our DAR(p) process and compare I(k) obtainedfrom DNA sequences with the results from such simulated sequences. By this method,we can distinguish between contributions from 3nite (Markov-like) correlations andlong-range correlations present in the experimental sequence. We 3nd that some as-pects of the mutual information function can be interpreted as additional short- andintermediate-range contributions to the correlation structure.

In general, the simple mathematical model of higher-order Markov processes, whichwe represent in terms of a DAR(p) process, allows us to approximately distinguishbetween 3nite-range correlations (of length p) and truly long-range correlations.

Methods from information theory are on the verge of becoming an important way oflooking at the correlation structure of DNA sequences. The main task is to 3nd soundbiological interpretations for the various aspects of the mutual information function andthe sequence of conditional entropies. In the long run the aim of such methods is torelate results from information theory to speci3c biological and structural propertiesof the sequences. Generally we see a development towards multi-step model building,where each step takes care of speci3c features of DNA sequences. Our present paperdescribes a step that concentrates on short- and medium-range correlations. One of thequestions on the way is, what signature do certain statistical properties of a symbolsequence leave in the information theory observables? With the present paper we haveattempted to provide some steps in this direction. We expect that particularly for theclass of hidden Markov models (HMM), see Refs. [13,29,30], such DAR(p) processescan turn out to be a useful ingredient in the process of parameter estimation.

Acknowledgements

We would like to thank Annette Hurst for an introduction to gene databases and JanFreund and Rolf Schassberger for fruitful discussion on the 3rst draft of our paper. Oneof us (M.H.) acknowledges support from Deutsche Forschungsgemeinschaft (contractHU 937/1).

Appendix A. Parameter estimation

In this appendix we will introduce a parameter estimating procedure for DAR(p)models by Yule–Walker equations.


Let {Xn} be a stationary DAR(p) process with marginal distribution �, parameter �and parameter vector �̃=(�1; �2; : : : ; �p) as de3ned in Eq. (7). Let m=E(Xn)=E(Yn).In order to use the established estimation theory for DAR(p) processes we map oursymbolic alphabet onto a set of numerical values, e.g. {1; : : : ; 4}, without change ofnotation for Xn and Yn. We have checked explicitly that the following results do notdepend on the particular choice of the mapping.

Centring the Xn’s in the way of X ′n = Xn − m, then multiplying Eq. (7) by X ′

n−k ,k ¿ 0 and taking the expectations we obtain [14]

E(X ′n−kX

′n) = ��1E(X ′

n−kX′n−1) + ��2E(X ′

n−kX′n−2) + · · ·

+ ��pE(X ′n−kX

′n−p) + (1 − �)E(X ′

n−kYn − m) : (A.1)

Eq. (A.1) refers to the autocovariances, dividing both sides by the variance of Xn weobtain the corresponding version for the autocorrelations r(k).

r(k) = ��1r(k − 1) + ��2r(k − 2) + · · · + ��pr(k − p); k¿ 1 : (A.2)

In this way we obtain a set of Yule–Walker equations

r(1) = ��1r(0) + ��2r(1) + · · · + ��pr(p− 1) ; (A.3)

r(2) = ��1r(1) + ��2r(0) + · · · + ��pr(p− 2) ; (A.4)

......

......

r(p) = ��1r(p− 1) + ��2r(p− 2) + · · · + ��pr(0) (A.5)

and, for l¿ 1,

r(p+ l) = ��1r(p+ l− 1) + ��2r(p+ l− 2) + · · · + ��pr(l) ; (A.6)

where r(0) = 1. Given r(1); r(2); : : : ; r(p), the 3rst p equations can be solved for thep parameters �1 : : : �p−1 and �. The parameter �p is given by (1 − �1 − · · · − �p−1).

In order to estimate the process parameters, we will 3rst estimate the r(k) and thensolve (A.3)–(A.5) to get the required estimates for �1; : : : �p−1; �. There are diEerentmethods to estimate the r(k). We will use the ad hoc estimator which is shown to bea strongly consistent estimator in Ref. [15] and performed very well in that study. Itcan be evaluated directly on the basis of the symbolic alphabet.

Let {xn} be a realisation of length m of a stationary DAR(p) process (7)with alphabet A = {a1; a2; : : : ; a�}, �∈N, �¿ 1. The ad hoc estimator is de3ned asfollows [15]:

r̂(k) = 1 −∑

ai∈ABm(k; ai)

11 − �(ai)

(A.7)


for k = 1; 2; : : : with

Bm(k; ai) =1

m− k

∑

ai �=aj∈A

m−k∑

l=1

'ai(xl)'aj (xl+k) ; (A.8)

where 'y(x) = 1 if x = y and 0 otherwise.

Appendix B. Conditional entropies and mutual information

Methods from information theory are well established to analyse correlation prop-erties of DNA sequences. The mutual information function, e.g., has been employedto detect long-range correlations as well as short periodicities, like in protein-codingDNA [23,24,27]. It has been shown that the mutual information function for codingsequences diEers strongly from that for non-coding segments [7]. The Shannon en-tropy and generalisations of it are applied, e.g., to 3nd borders between coding andnon-coding DNA [26] or repetitive segments [27].

The following de3nitions, Eqs. (B.1)–(B.5), are all adapted from Shannon’s seminalwork on information theory [31].

Let p(ai) be the probability of 3nding the symbol ai ∈A in the full sequence. Thenthe (Shannon) entropy is de3ned as

H = −�∑

i=1

p(ai) log� p(ai) : (B.1)

The quantity H establishes a measure for the mean uncertainty of a process. A frequentchoice of the base � of the logarithm, which is also found in Shannon’s original work[31], is � = 2 giving all information in binary units (bits). Alternatively, choosing �equal to the size � of the alphabet underlying a given sequence allows comparison ofsymbol sequences based on alphabets with diEerent sizes.

Let p(c1; : : : ; cn) be the probability of 3nding the symbol segment (c1; : : : ; cn)∈Anof length n in the full sequence. The entropy per block of length n is de3ned by

Hn := −∑

(c1 ;:::;cn)∈Anp(c1; : : : ; cn) log� p(c1; : : : ; cn) : (B.2)

The quantity Hn expresses the mean uncertainty for prediction of blocks of length n.The quantity

hn := Hn+1 − Hn (B.3)

for n=1; 2; : : : and h0 := H1 is termed conditional entropy where the name conditionalentropy derives form the fact that hn can be expressed by the conditional probabilities


of the symbol cn+1:

hn =−∑

(c1 ;:::;cn)∈Anp(c1; : : : ; cn)

∑

cn+1∈Ap(cn+1|c1; : : : ; cn)

× log� p(cn+1|c1; : : : ; cn) : (B.4)

The quantity hn expresses the mean uncertainty for prediction of the symbol cn+1,provided the preceding n symbols c1; : : : ; cn are known. Thus, the conditional entropyhn quanti3es the predictability of a process [10]. The quantity h :≡ limn→∞ hn is termedentropy of the source [32]. The conditional entropy for a random uncorrelated sequence(Bernoulli sequence) obeys h1 = h2 = · · · = h. Correlations within a sequence yield adecreasing conditional entropy. A pth order Markov sequence gives hp=hp+1 = · · ·=hand periodic sequences of period q yield hq = hq+1 = · · · = h= 0.

Another measure quantifying statistical dependences within a symbol sequence is themutual information function. Let p(k)(c1; c2) denote the joint probability of 3nding thesymbol c1 ∈A and the symbol c2 ∈A in a distance k (e.g., with k−1 symbols betweenthem in a given sequence). Under the assumption of stationarity, the joint probabilityp(k)(c1; c2) only depends on the distance k of the symbols (c1; c2)∈A2 within thesequence. The mutual information function is de3ned as

I(k) :=∑

(c1 ;c2)∈A2

p(k)(c1; c2) log�p(k)(c1; c2)p(c1) · p(c2)

: (B.5)

The quantity I(k) quanti3es the amount of information which can be obtained fromthe symbol c1 about a later symbol c2 that is located k symbols away from c1 withinthe sequence.

The mutual information function is related to the Kullback information [33] and canbe expressed as the diEerence of entropies [34].

A 3nite sample size yields a systematic error of the estimates of the entropy Hn.This systematic underestimation of Hn has been already extensively investigated, e.g.Refs. [35,36]. It increases with decreasing sample size L, increasing alphabet � andincreasing word length n [35,37].

Let M (n)� ¡�n be the total number of diEerent blocks of length n in a symbol

sequence of length L based on an alphabet of � letters. The bias of the estimateof Hn is of the order (M (n)

� − 1)=(2L ln �) (see Ref. [23] and references therein). Amore re3ned view on word count distributions for Markov sequences can be found inRefs. [38,39]. The systematic underestimation of Hn yields a systematic underestimationof hn. The bias of the estimate of hn is of the order (M (n+1)

� − M (n)� )=(2L ln �). The

systematic underestimation of Hn yields a systematic overestimation of I(k) [23], witha bias of the order (�− 1)2=(2L ln �).

Appendix C. Analytical results

In this appendix we will introduce some analytical aspects of the DAR(p) process.For a DAR(2) process we will derive the joint probabilities P(Xn+1 = cn+1; Xn = cn)analytically.


Let {Xn} be a stationary DAR(p) process as de3ned in Eq. (7). The DAR(p)process is speci3ed by the marginal distribution � of Xn, where there is no limitationon �. Independent of the marginal distribution the correlation structure is determinedby several other parameters. This yields P(Xn = ai) = �(ai), ai ∈A and the particularformula for the conditional probabilities with exactly p-steps in the condition (cf. (1.5)in Ref. [15])

P(Xn+1 = cn+1|Xn−p+1 = cn−p+1; : : : ; Xn = cn)

= (1 − �)�(cn+1) +p∑

k=1

��k'cn+1(ck) ; (C.1)

where (c1; : : : ; cn+1)∈An+1, 'y(x) = 1 if x = y and 0 otherwise.The one-step conditional probabilities are given by

P(Xn+1 = cn+1|Xn = cn) = (1 − �)�(cn+1) + ��1'cn+1(cn)

+p∑

k=2

��kP(Xn = cn; Xn−k+1 = cn+1)

P(Xn = cn): (C.2)

For p= 2 we obtain a system of linear equations for the joint probabilities P(Xn+1 =cn+1; Xn = cn) by multiplying Eq. (C.2) with P(Xn = cn) = �(cn)

P(Xn+1 = cn+1; Xn = cn) = (1 − �)�(cn+1)�(cn)

+ ��1'cn+1(cn)�(cn)

+ ��2P(Xn = cn+1; Xn+1 = cn) ; (C.3)

where cn; cn+1 ∈A.The solution of the system for P(Xn+1 = cn+1; Xn = cn) is given by

P(Xn+1 = cn+1; Xn = cn) = (1 − �)�(cn+1)�(cn)1

1 − ��2; cn+1 = cn ; (C.4)

P(Xn+1 = cn+1; Xn = cn) = (1 − �)�(cn+1)�(cn)1

1 − ��2

+ �(cn)��1

1 − ��2; cn+1 = cn : (C.5)

These results can be compared with our simulations where we compute estimates ofthese joint probabilities. The agreement is excellent showing that the estimates indeedconverge towards the values in (C.4) and (C.5).

References

[1] W. Li, The study of correlation structures of DNA sequences: a critical review, Computers Chem. 21(4) (1997) 257–271.

[2] W. Li, K. Kaneko, Long-range correlation and partial 1=f� spectrum in a noncoding DNA sequence,Europhys. Lett. 17 (7) (1992) 655–660.


[3] C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H.E. Stanley,Long-range correlations in nucleotide sequences, Nature 356 (1992) 168–170.

[4] R.F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys.Rev. Lett. 68 (25) (1992) 3805–3808.

[5] G. Bernardi, B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, F. Rodier,The mosaic genome of warm-blooded vertebrates, Science 228 (1985) 953–958.

[6] S. Karlin, V. Brendel, Patchiness and correlations in DNA sequences, Science 259 (1993) 677–680.[7] I. Grosse, H. Herzel, S.V. Buldyrev, H.E. Stanley, Species independence of mutual information in

coding and noncoding DNA, Phys. Rev. E 61 (5) (2000) 5624–5629.[8] E.N. Trifonov, 3-, 10.5-, 200- and 400-base periodicities in genome sequences, Physica A 249 (1998)

511–516.[9] B. Audit, C. Vaillant, A. Arneodo, Y. d’Aubenton-Carafa, C. Thermes, Long-range correlations between

DNA bending sites: relation to the structure and dynamics of nucleosomes, J. Mol. Biol. 316 (2002)903–918.

[10] W. Ebeling, J. Freund, K. Rateitschak, Entropy and extended memory in discrete chaotic dynamics, Int.J. Bifurc. Chaos 6 (4) (1996) 611–625.

[11] J. Freund, Symbolic dynamics approach to stochastic processes, in: L. Schimansky-Geier, T. P,oschel(Eds.), Stochastic Dynamics, Springer, Berlin, 1997.

[12] A. Raftery, S. TavarUe, Estimation and modelling repeated patterns in high order Markov chains withthe Mixture Transition Distribution model, Appl. Stat. 43 (1) (1994) 179–199.

[13] I.L. MacDonald, W. Zucchini, Hidden Markov and Other Models for Discrete-valued Time Series,Chapman & Hall, London, 1997.

[14] P.A. Jacobs, P.A.W. Lewis, Discrete time series generated by mixtures III: autoregressive processes(DAR(p)), Naval Postgraduate School, Monterey, CA, 1978.

[15] P.A. Jacobs, P.A.W. Lewis, Stationary discrete autoregressive-moving average time series generated bymixtures, J. Time Ser. Anal. 4 (1) (1983) 19–36.

[16] M. Kijima, Markov Processes for Stochastic Modeling, Chapman & Hall, London, 1997.[17] I. Grosse, P. Bernaola-GalvUan, P. Carpena, R. RomUan-RoldUan, J. Oliver, H.E. Stanley, Analysis of

symbolic sequences using the Jensen–Shannon divergence, Phys. Rev. E 65 (2002) 041905.[18] W. Li, P. Bernaola-GalvUan, F. Haghighi, I. Grosse, Applications of recursive segmentation to the

analysis of DNA sequences, Comput. Chem. (informatics and the genome special issue) 26 (5) (2002)491–510.

[19] D.P. Heyman, A. Tabatabai, T.V. Lakshman, Statistical analysis and simulation study of videoteleconference tra1c in ATM networks, IEEE Transactions on circuits and systems for video technology2 (1) (1992) 49–59.

[20] L. Gatlin, Information Theory and the Living System, Columbia University Press, New York, 1972.[21] W. Li, Mutual information functions versus correlation functions, Stat. Phys. 60 (1990) 823–837.[22] H. Herzel, W. Ebeling, A.O. Schmitt, Entropies of biosequences: the role of repeats, Phys. Rev. E 50

(6) (1994) 5061–5071.[23] H. Herzel, I. Grosse, Measuring correlations in symbol sequences, Physica A 216 (1995) 518–542.[24] H. Herzel, I. Grosse, Correlations in DNA sequences: the role of protein coding segments, Phys. Rev.

E 55 (1) (1997) 800–810.[25] T.G. Dewey, Statistical mechanics of protein sequences, Phys. Rev. E 60 (4) (1999) 4652–4658.[26] P. Bernaola-GalvUan, I. Grosse, P. Carpena, J. Oliver, R. RomUan-RoldUan, H.E. Stanley, Finding borders

between coding and noncoding DNA regions by an entropic segmentation method, Phys. Rev. Lett. 85(6) (2000) 1342–1345.

[27] D. Holste, I. Grosse, H. Herzel, Statistical analysis of the DNA sequence of human chromosome 22,Phys. Rev. E 64 (4) (2001) 041917.

[28] M.-Th. H,utt, Datenanalyse in der Biologie, Springer, Berlin, 2001.[29] G. Churchill, Hidden Markov chains and the analysis of genome structure, Computers Chem. 16 (1992)

107–115.[30] P. Baldi, S. Brunak, Bioinformatics: the Machine Learning Approach, 2nd Edition, MIT Press,

Cambridge, MA, 2001.


[31] C.E. Shannon, A mathematical theory of communication, Bell System Tech. J. 27 (1948) 379–423,623–656.

[32] A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957.[33] W. Ebeling, J. Freund, F. Schweitzer, Komplexe Strukturen: Entropie und Transinformation, Teubner

Verlag, Leipzig, 1998.[34] H. Herzel, W. Ebeling, The decay of correlations in chaotic maps, Phys. Lett. A 111 (1–2) (1985)

1–4.[35] H. Herzel, A.O. Schmitt, W. Ebeling, Finite sample eEects in sequence analysis, Chaos Solitons Fractals

4 (1) (1994) 97–113.[36] I. Grosse, Estimating entropies from 3nite samples, in: J. Freund (Ed.), Dynamik-Evolution-Strukturen,

Verlag Dr. K,oster, Berlin, 1996.[37] A.O. Schmitt, H. Herzel, W. Ebeling, A new method to calculate higher-order entropies from 3nite

samples, Europhys. Lett. 23 (5) (1993) 303–309.[38] G. Reinert, S. Schbath, M.S. Waterman, Probabilistic and statistical properties of words: an overview,

J. Comput. Biol. 7 (1–2) (2000) 1–46.[39] S. Schbath, An overview on the distribution of word counts in Markov chains, J. Comput. Biol. 7 (1–2)

(2000) 193–201.

Documents

A discrete autoregressive process as a model for short ... · processes. In this case statistical properties of the sequences reect algorithmic proper-ties of the process. Stochastic