Information and communication theory in molecular biologybiitcomm/research/references/Pavol Hanus... · Electr Eng (2007) 90:161–173 DOI 10.1007/s00202-007-0062-6 ORIGINAL PAPER

Electr Eng (2007) 90:161–173DOI 10.1007/s00202-007-0062-6

ORIGINAL PAPER

Information and communication theory in molecular biology

Pavol Hanus · Bernhard Goebel · Janis Dingel ·Johanna Weindl · Juergen Zech · Zaher Dawy ·Joachim Hagenauer · Jakob C. Mueller

Received: 5 January 2007 / Accepted: 13 January 2007 / Published online: 4 April 2007© Springer-Verlag 2007

Abstract The DNA sequencing efforts of the pastyears together with rapid progress in sequencing tech-nology have generated a huge amount of sequence dataavailable in public molecular databases. This recentdevelopment makes it statistically feasible to apply uni-versal concepts from Shannon’s information theory toproblems in molecular biology, e.g to use mutual infor-mation for gene mapping and phylogenetic classifica-tion. Additionally, the genetic information in the cell iscontinuously subject to mutations. However, it has to bepassed from generation to generation with high fidelity,raising the question of existence of error protection andcorrection mechanisms similar to those used in techni-cal communication systems. Finally, better understand-ing of genetic information processing on the molecularlevel in the cell can be acquired by looking for parallelsto well established models in communication theory, e.g.there exist analogies between gene expression and framesynchronization.

P. Hanus (B) · B. Goebel · J. Dingel · J. Weindl · J. HagenauerInstitute for Communications Engineering, TU-München,Arcisstr. 21, 80290 Munich, Germanye-mail: [email protected]

J. ZechInstitute for Statistical Medicine and Epidemiology,TU-München, Clinic “Rechts der Isar”, Ismaninger Str. 22,81675 Munich, Germany

Z. DawyDepartment of Electrical and Computer Engineering,American University of Beirut, P.O. Box 11-0236,Riad El-Solh, Beirut 1107 2020, Lebanon

J. C. MuellerMax Planck Institute for Ornithology, Postfach 1564,Haus Nr. 5, 82319 Seewiesen, Germany

Keywords Information theory · Communicationtheory · Molecular genetics · Classification · Genemapping · Error correction · Frame synchronization

1 Introduction

Communications engineering as well as genetics haveboth experienced a major breakthrough in the mid twen-tieth century. In 1953, the double helix structure of theDNA was deciphered by Watson and Crick. From thispoint on it was clear that the genetic information isstored in form of two complementary directed strandscomposed of letters from a four symbol alphabet. Untilthe discovery of the molecular basis of genetics, theresearch was concentrating on classical genetics, basedon the rules of Mendelian inheritance of traits. Shannon[21] himself was using mathematics to study how differ-ent trait combinations propagated through several gen-erations of breeding in his Ph.D thesis completed in1940. He devised a general expression for the distribu-tion of several linked traits in a population after multiplegenerations under a random mating system, which wasoriginal at that time, but went largely unnoticed, sincehe did not publish his work. After completing his Ph.Dthesis, Shannon shifted his focus towards digital com-munications and cryptography.

In 1948, Shannon [22] established the theoretical fun-damentals of digital communication systems. He intro-duced the concept of information based solely on thestatistical characteristics of the information source. Hedefined information in an abstract way independent ofsemantics that does not differentiate between text, videoor audio as was generally being done when studying

162 Electr Eng (2007) 90:161–173

communication systems at that time. Using suchinformation definition, Shannon proved that a messagegenerated by an information source can be losslesslycompressed to the entropy of the source (source codingtheorem) and that it is possible to code the informa-tion in a way, such that one can transmit it error-freeat the maximum rate that the channel allows (channelcoding theorem). Ever since, communications engineershave been devising algorithms to achieve the limits ofthese two theorems. The definition of information basedsolely on statistical characteristics of the informationsource also applies to genetic data. Recent advancesin DNA sequencing technology supply enough data toapply Shannon’s general information concept to molec-ular biology. Section 2 gives a short introduction to basicprinciples from molecular biology required for betterunderstanding of the following sections. In Sect. 3 weshow how mutual information and compression can beused for phylogenetic classification. Section 4 describesthe application of mutual information to gene map-ping. The question whether an error correcting codehas evolved on the genome sequence level is addressedin Sect. 5. Finally, in Sect. 6 we model transcription ini-tiation (one step in protein synthesis) as frame synchro-nization in a communication system.

The original involvement of information theoristswith molecular genetics goes back to the discovery ofthe genetic code. In the period between the discovery ofthe DNA structure in 1953 and the decipherment of thegenetic code 1961–1969, when no actual DNA sequencesand only very few amino acid sequences were known,several different coding schemes describing the map-ping of the DNA sequence (four letter alphabet) to aprotein (amino acid sequence from a 20 letter alpha-bet) were proposed by coding theory experts. Some ofthem had high information density, while others haveforeseen error correction capabilities. The experimentaldiscovery of the actual genetic code (the mapping rule ofthe 43 = 64 DNA sequence triplets to the 20 amino acidsand a stop symbol) was a disappointment for the codingcommunity since it does not seem to implement any ofthe two. A review of the proposed codes can be foundin [12]. From this point, there has been little interactionbetween the two communities until recently. We believethat with all the newly available sequence data furtherinteractions could be fruitful as our research suggests.The question why the genetic code has evolved the wayit is remains open. There seems to be evidence for theoptimality of the code in terms of error minimizationusing metrics based on physio-chemical properties ofthe resulting amino acids like their hydrophobicity [10].Apparently, evolution imposes additional constraintson the optimization of how the genetic information is

being stored, which makes the modeling rather peculiar.This has to be accounted for by communications engi-neers modeling evolution and the molecular processingof genetic information in the cell as a communicationsystem.

2 Biological background

2.1 DNA

In 1944, the desoxyribonucleic acid was identified asthe primary carrier of genetic information. The discov-ery of the geometric arrangement of the DNA buildingblocks in a double helix by Watson and Crick followed in1953. The DNA consists of two complementary directedstrands of nucleotides. Each nucleotide is composed ofa backbone unit (sugar and phosphate) and one of thefour bases adenine (A), guanine (G), cytosine (C) or thy-mine (T). The sugar phosphate backbone determines thedirection of each strand which is referred to as 5′ to 3′ byconvention. The two strands are held together by elec-trostatic interaction via weak hydrogen bonds betweenthe complementary bases A–T and C–G, see DNA inFig. 1. Here, nature has implemented a simple comple-mentary repetition code, which is very advantageous forDNA replication, that has to take place every time acell divides. Each of the two complementary strands isused as template for the DNA copy of one of the twodaughter cells.

2.2 Mutations

The process of copying is prone to errors leading topoint-mutations, insertions, deletions and duplications.According to evolutionary theory a certain degree ofmutation is necessary to allow for adaptation of differ-ent species to changing environmental conditions. Prop-agation of evolutionary disadvantageous mutations ishindered by natural selection in contrast to neutral andthe rare advantageous mutations. Assuming a commonancestor, the degree of dissimilarity in the genomes ofexisting species can be used to reconstruct their phy-logenetic relationships, as shown in Sect. 3. Mutationalvariations observed across the human population arethe origin of genetically influenced diseases. The mainobjective of gene mapping is to determine which of thevarying positions in the genome, also referred to as sin-gle nucleotide polymorphisms (SNPs) [1] are relatedto the disease under investigation. Section 4 describesan information theoretical method to identify the SNPswhich are statistically related to the investigated disease.It relies on population based data from clinical stud-

Electr Eng (2007) 90:161–173 163

Fig. 1 Protein synthesis

ies. Since high rate of mutation would lead to too manyevolutionary disadvantageous mutations per generationcycle, it is crucial that the genome copying process takesplace with high fidelity. Nature has implemented mech-anisms to minimize the error susceptibility of the copy-ing machinery. However, error protecting measures onthe sequence level similar to error correcting codes incommunication systems are currently not known. Webelieve that especially in case of complex multicellu-lar (eukaryotic) organisms, which have long genera-tion cycles and a limited number of offsprings, naturemight have developed sequence level error correctingmeasures to ensure the necessary high replication fidel-ity. The primary and best understood function of thegenome is to carry information for the synthesis of pro-teins, see Sect. 2.3. However, in complex eukaryotes likevertebrate the proportion of the genome actually cod-ing for proteins is less than 10%, as opposed to simplefast evolving single cell organisms (prokaryotes), wherealmost all of the genome codes for proteins. The non-coding part has been largely neglected by the researchcommunity for a long time until comparative genom-ics has recently identified regions in the genomes ofvertebrate species that do not code for proteins, butshow a high degree of evolutionary conservation [26],labeled conserved non-genic region (CNG) in Fig. 2.This implies some unknown evolutionary importantfunction. The proportion of such conserved non-cod-ing regions in the human genome is comparable to thatof protein coding regions. Currently, our search for errorprotecting means on the sequence level concentrates onthese regions, see Sect. 5. They might be carrying parityinformation to protect the coding regions.

2.3 Protein synthesis

The protein coding part of the genome is converted toproteins in a process called gene expression. It takesplace in two basic steps, see Fig. 1. First, during tran-scription the genomic DNA region coding for a protein

Fig. 2 Genome organization of multicellular organisms

is copied into messenger RNA (mRNA) by the RNApolymerase molecule. The resulting mRNA correspondsto a complementary copy of the template strand exceptthat the base T (thymine) is substituted by U (uracil).In the second step, the ribosome molecule translatesthe mRNA into a sequence of amino acids—a protein.Hereby, triplets of bases are converted to amino acidsaccording to the mapping rule described by the geneticcode [19].

2.4 Genome structure

The protein coding portion of the genome is arrangedin genes. The genes vary in size and are randomly dis-tributed across the genome. The beginning of a geneis characterized by a promoter sequence in front ofit. The end is signalled by a terminator. During tran-scription initiation, the first step in protein synthesis,the promoter sequence has to be detected. This resem-bles frame synchronization in digital communicationsystems. Further investigation of this analogy is pre-sented in Sect. 6. In eukaryotes the mRNA producedduring transcription contains non-coding regions calledintrons. These are being spliced out (removed fromthe mRNA) before translation occurs. Only the codingexons are finally translated to protein. The describedgenome structure is depicted in Fig. 2. The content rec-ognition method described in Sect. 3 can be used todistinguish between the coding exons, non-coding buttranscribed introns and the non-genic regions nottaking part in gene expression.

3 DNA classification using compression distancemeasures based on mutual information

The possibility of using mutual information for classifi-cation and content recognition of genetic sequences isexploited in this section. Two different mutual informa-tion based distance measures are proposed, one for clas-sification and one for content recognition. The measure

164 Electr Eng (2007) 90:161–173

proposed for classification is a metric. The influence ofcompression based entropy estimation on the proposedmeasures is investigated. Examples of successful appli-cations in the field of genetics are presented.

Mutual information describes the amount of infor-mation shared by stochastic processes. It can be used toderive distance measures quantifying the similarity ofthe processes. Mutual information based distance mea-sures can be used to compare texts written by differentauthors or to build phylogenies of different species.

3.1 Compression based entropy approximation

The definition of mutual information is based on theentropies of the compared sources, which will be approx-imated using compression. The idea of using compres-sion for phylogenetic classification of whole genomeswas first introduced in [14]. Shannon’s fundamental the-orem on data compression states that every source Scan be losslessly compressed up to its entropy rate H(S).Thus, the compression ratio achieved by an optimal com-pression algorithm designed for a given source S whencompressing a message s generated by this source is agood approximation of the sources actual entropy rate

H(S) ≈ |comp(s)||s| , (1)

where |.| denotes the size in bits or symbols. The entropyof DNA sequences is less than two bit due to the use ofa four symbol alphabet (A,C,G,T).

In general a universal compressor for a whole classof sources (e.g. DNA sequences, natural texts) is avail-able. Such universal compressors gradually adjust theirunderlying general statistical model describing thewhole class of sources to the individual statistics ofthe particular message being compressed. For example,genomic DNA sources contain approximate repeats andpalindromes (reverse complements) due to duplicationsand point mutations that occur during evolution. DNA-compress uses this general property of genomic DNAand compresses the specific repeats occurring in theparticular sequence being compressed. Such universalcompressors are particularly suited to compare sourcesof a given class as they should be able to compress well aconcatenation of messages generated by similar sourcesas opposed to dissimilar ones. Consequently, the condi-tional entropy H(Si|Sj) of two different sources Si and Sjwill be approximated as the compression ratio achievedfor the message si when the compressor’s model istrained on the message sj. The compression size of theconcatenated sequences |comp(sj, si)| can be used for

this purpose

H(Si|Sj) ≈ |comp(sj, si)| − |comp(sj)||si| . (2)

3.2 Mutual information based distance measures

The aim of unsupervised classification is to build clus-ters of all sources Si based on chosen criteria. A distancemetric d(Si, Sj) quantifying the similarity of the sourcesis required for such clustering.

Content recognition serves a different purpose. Here,a set C of known content sources SC

i , i ∈ {1 . . . |C|} isprovided together with a set U of unknown sourcesSU

j , j ∈ {1 . . . |U|}. The goal is to find the best match-

ing content source SCb with the smallest distance b =

arg mini(d(SCi , SU

j )) for each unknown source SUj . The

distance measure for content recognition on the con-trary to classification does not have to satisfy the axiomsof a metric.

Information theory describes the relatedness ofsources Si and Sj as the mutual information I(Si; Sj)

shared by these sources

I(Si; Sj) = H(Si) − H(Si|Sj) = I(Sj; Si). (3)

Mutual information is an absolute measure of informa-tion common to both sources. It can be transformed to abounded distance through normalization in two differ-ent ways: one way, to be used for content recognition, isto normalize by the maximum possible mutual informa-tion the two sources can share, resulting in

dCR(Si, Sj) = 1 − I(Si; Sj)

min(H(Si), H(Sj))≤ 1. (4)

The lower bound is reached for sources that share themaximum possible mutual information given theirentropies. It can be reformulated using conditionalentropies

dCR(Si, Sj) = min(H(Si|Sj), H(Sj|Si))

min(H(Si), H(Sj)). (5)

Using the compression based approximations in (1) and(2) it can be written as

dCR = |comp(sj, si)| − |comp(sj)||comp(si)| , (6)

for |comp(si)| < |comp(sj)|. Since the triangle inequal-ity is not satisfied for dCR this measure is not a metricdistance. Thus for classification we normalize I(Si; Sj) bythe maximum entropy of both sources resulting in thefollowing distance metric

dCL(Si, Sj) = 1 − I(Si; Sj)

max(H(Si), H(Sj))≤ 1. (7)

Electr Eng (2007) 90:161–173 165

Compared to dCR in (4) the two sources must not onlyshare maximum possible mutual information, but alsoneed to have identical entropies in order to achievedCL = 0.

The advantage of the compression based approxima-tion of the derived distances is that no prior alignmentof the compared sequences si and sj is necessary.

3.3 Results

Different types of compression algorithms were testedwith respect to their classification and content recogni-tion performance: Lempel–Ziv, Context Tree Weighting,Burrows Wheeler Transform, Prediction by PartialMatching (PPM) and DNACompress. In general PPMand DNACompress performed best for geneticsequences. A set of properties making a compressionalgorithm suitable for classification and content recog-nition was derived in [7].

A typical classification problem in molecular geneticsis reconstruction of phylogenetic relationships betweendifferent populations (e.g. human populations, differ-ent mammalian species) in form of a binary tree, wherethe nodes represent the separation events and the rootthe common ancestor of all the investigated populationsaccording to the evolutionary theory. Figure 3 shows aphylogenetic tree of the human population constructedusing dCL with DNACompress and the quartet tree gen-eration method described in [4]. Mitochondrial DNA(mtDNA) was used for this study. It is about 16,000bases long and particularly suited for phylogenetic stud-ies, since it is inherited only maternally and shows highrate of mutation because it resides in mitochondria out-side of the cells protecting nucleus. The migration pat-tern observed in the tree corresponds to the currentlyaccepted theory of African human origin and the resultspresented in [27]. Interesting highlight is the close rela-tionship between North American Navaho descendantsand the European Finnish population, indicating thatNorth America might have not only been populatedfrom north eastern Asia by crossing the Bering landbridge, but possibly also through the Arctic.

To demonstrate the content recognition performanceof the derived measure, we present the results for con-tent recognition of non-genic regions (ng), exons (ex)and introns (in). As content sequences the first 50,000nucleotides (50 kb) of concatenated sequences of eachtype were taken from the human chromosome 19 (c19).Sequences of different sizes of each type taken from thebeginning of chromosome 1 (c1) were used as unknownsequences. For each unknown sequence j the distancedCR(SC

i , SUj ) to every content sequence i was calculated.

Using DNACompress and dCR all unknown sequences

Eve

Africa

India, Aust.

Asia, Amer.

Asia, Amer.

Euro, Amer.

Asia

Euro

Afri-MbutiAfri-SanAfri-EffikAfri-SouthAfricanAfri-LisongoAfri-Bamileke

Indi-KannadaAust-Aborigine

Asia-EskimoAmeS-Warao

AmeS-Guarani

Asia-JapaneseAsia-SiberianInuitAmeN-Native

AmeN-NavajoEuro-Finnish

Asia-ChineseAsia-Thai

Euro-SpanishAfri-MoroccanEuro-EnglishEuro-DutchEuro-GermanEuro-Caucasian

Fig. 3 Human phylogeny based on mtDNA

Table 1 Content recognition of non-genic regions (ng), introns(in) and exons (ex)

SUj \ SC

i c19ng-50kb c19in-50kb c19ex-50kbc1ng-300kb 0.04-best 0.84 1.02c1ng-13kb 0.65-best 1.01 1.01c1in-300kb 0.93 0.58-best 1.01c1in-13kb 1.00 0.05-best 1.07c1ex-300kb 1.02 1.01 0.96-bestc1ex-13kb 0.98 0.94 0.83-best

were recognized correctly as shown in Table 1. Somedistances are greater than 1 due to the concatenationin the compression based approximation of conditionalentropy in (2), leading to high compression ratios if adissimilar sequence is used for training.

The obtained results demonstrate how the deriveddistance measures approximated using compression cansuccessfully be applied to phylogenetics and recogni-tion of sequence type. In Sect. 4 the dCL distance mea-sure will be used for pairwise SNP comparison in genemapping.

4 Gene mapping and marker clustering usingShannon’s mutual information

This section discusses the application of Shannon’s infor-mation theory to population-based gene mapping. Inaddition, a mutual information based distance measureis used in conjunction with multidimensional scalingto build and visualize clusters of genetic markers. The

166 Electr Eng (2007) 90:161–173

presented approaches are applied to clinical data onautoimmune Graves’ disease.

Mutual information, defined as

I(X; Y) =∑

x

∑

y

p(x, y) log2p(x, y)

p(x)p(y), (8)

where X and Y are random variables, can be interpretedas the reduction in entropy (or uncertainty) of one ran-dom variable given another. In the following, it will beused as a measure of dependence between the physicalmanifestation of a trait (phenotype) and the underly-ing genetic make-up (genotype). Connecting particularphenotypes with the causal genotypes is the main aim ofgene mapping.

4.1 Gene mapping

About 90% of deviations between the genomes of twoindividuals from a population are single point muta-tions. Such variations in the genomes of a populationoccurring with a relative frequency ≥1% are referredto as single nucleotide polymorphisms (SNPs). It is esti-mated that only about 0.3% of the human genome areSNPs. The term allele refers to the nucleotide observedat a particular SNP locus (position) in an individual.At most one mutation per genome position is assumedto have occurred during the short human evolution.This assumption results in biallelic SNPs—exactly twodifferent alleles are observable per SNP in a popula-tion. Assume that a particular region of the genome wassequenced across the population resulting in ACCGTAin 76% of the cases and ATCGTA in 24%. The secondposition would thus be a SNP with major allele A = Cand minor allele a = T. In a simplistic view sexuallyreproducing organisms posses two homologous copiesof their genome, each inherited from one of the parents.Thus, per SNP locus we observe two alleles, one fromeach parental side. An individual will have either inher-ited two homozygous alleles from both parents (eitherAA or aa) or two heterozygous (a different allele fromeach parent Aa or aA). Modeled as a discrete randomvariable each SNP locus would thus have four possiblerealizations. However, the genotyping does not allowto distinguish the parental origin of the alleles (Aa isindistinguishable from aA), reducing the number ofobservable realizations of a SNP to three.

In a typical clinical population-based gene-mappingstudy a small subset of L suspect SNP markers from theoverall estimated 10 million human SNP loci S1, S2, . . . ,SL is genotyped in N individuals. Preferably, in a pop-ulation-based disease study half of the individuals (thecases) carry the disease under investigation, the other

case

controlAA

aA/Aa

aa

Fig. 4 Genotype–phenotype transition diagram for a two-locusmodel

half (the controls) are healthy. In such case-controlstudies the phenotype P is a binary variable (healthy/dis-eased) and the genotype a set of ternary random vari-ables S1, S2, . . . , SL. Figure 4 depicts a simple channeldiagram describing the information transfer from a ter-nary SNP Si to the binary phenotype P. The probabilitiesof the random variables’ realizations and the transitionprobabilities can be derived from relative frequencies,i.e. observed counts divided by N. These probability esti-mates exhibit a variance that depends on the samplesize N. From these probabilities, the mutual informa-tion I(Si; P), where i = 1 . . . L between each SNP Si andthe phenotype P can be estimated to investigate eachSNPs causality both in absolute (through the unit bits)and relative terms [18].

The approach presented here for single SNPs andbinary phenotypes can be easily extended to the jointanalysis of multiple SNPs and/or higher order and con-tinuous phenotypes [6]. A detailed comparison of theproposed method with other statistical and signal pro-cessing based methods can be found in [20].

4.2 Marker clustering

So far, we have used mutual information between phe-notype and genotype. In this section, we use the mutualinformation between SNPs to find groups or clustersof correlated genetic markers, which are likely to formevolutionary entities. This is an important tool for genemapping, as it can provide additional hints about whichmarkers should be interpreted jointly. As distance mea-sure between two SNPs the metric dCL(Si, Sj) presentedin (7) is applied. In this case Si and Sj represent SNPs. Inorder to avoid biased results, only the data from the con-trols should be used to compute the distances betweenall markers. Subsequently, classical multidimensionalscaling can be used to cluster and visualize the SNPs intwo- or three-dimensional space for further analysis [5].

4.3 Results

The proposed methods were successfully tested on sim-ulated and real data sets. The clinical data set describedin [25] was used to generate the results presented in thissection. The study suspects a 317 kb long region across

Electr Eng (2007) 90:161–173 167

0 50 100 150 200 250 300 350 4000

0.005

0.01

0.015

0.02

(kb)

Mut

ual I

nfor

mat

ion

I(S

i;P)

2449

5758

Fig. 5 Mutual information in bit btw. Graves’ disease and eachSNP in a region suspect of being related to the autoimmunedisease

the genes CD28, CTLA4 and ICOS to be related to theGraves’ autoimmune disease. The region comprises 108dispersed SNP loci, which were genotyped in 384 casesand 652 controls. Figure 5 shows the mutual informationestimate for all 108 SNPs [6]. It should be noted that theeffects measured are relatively weak (≈0.01 bit as com-pared to the theoretical maximum of 1 bit). To determinethe results’ significance, the permutation-based criticalvalues of the total study (global null hypothesis based on5% significance level) have been determined and plot-ted. Our analysis of the autoimmune disease data set alsorevealed two study-wise significantly associated regions,which are identical to the most promising regions foundby the logistic regression analyses reported in [25]. Themultidimensional scaling clustering analysis (describedin Sect. 4.2) of the same dataset in two-dimensionalspace is depicted in Fig. 6. It can be seen that the SNP lociidentified as significantly related to the Graves’ diseasetend to cluster. The resulting cluster indicated by the

ellipse in Fig. 6 points to similar evolutionary historiesand ages of these markers. This implies with high proba-bility only a single causal marker among these SNP loci.

In comparison to other statistical gene-mappingmethods, applying the simple, yet theoretically well-defined concept of mutual information to the represen-tation of SNP-phenotype and SNP–SNP relationshipsdoes not require any assumptions to be made and thuslays out a consistent framework for a first screen in genemapping approaches.

5 Conserved non-genic elements—implementationsof error correcting codes?

The DNA is the primary carrier of genetic information.This information must be “transmitted” to various des-tinations. During cell replication the genomic informa-tion must be copied and passed on to the two daughtercells as each cell carries a copy of the whole genome.A further example is the transmission of genetic infor-mation from genes to proteins. The genetic transmissionchannels introduce noise and one might ask whethernature has developed error protecting means similar tothose that we use in digital data transmission over noisychannels in order to make reliable communication pos-sible? Consider the transmission of genetic informationover generations in evolutionary time. The DNA is sub-jected to mutations making this transmission channelnoisy. Assuming a simple model of nucleotide muta-tions, Battail [3] showed that the capacity of this channeldecreases exponentially over time. He concludes that,for any reasonable instantaneous mutation rate, genomeconservation over large geological timescales can only

Fig. 6 Marker clustering in atwo-dimensional scaling;autoimmune disease data

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

1234

56 7

8

9 10

1112

13

14

15

1617

18

19

20

21

2223

24

25 26

27

28

29 30

31

3233

3435

36 37

38

3940

41

42

43

44

45

46

47

48

49

50

51

52

53

545556

5758

5960

61 62

63

64

65

66

67

6869

70

71

72

73

74

75

7677

7879

80

81

8283 84

8586

87

8889

90

9192

9394

95

96

97

98 99100101

102103

104105

106

107

108109

168 Electr Eng (2007) 90:161–173

be explained by genome regeneration. He hypothesizesthat there exists an error correcting code implementedon the genome sequence level for this purpose and thatgenome regeneration must occur before the capacity ofthe channel falls below the error correction ability of thiscode [3]. Further evidence for this hypothesis is given bya recently published discovery about the error correct-ing ability of the plant Arabidopsis [15]. The experimentshows that mutations that are present in the genomes ofthe parents are corrected in the genomes of their off-springs with certain probability. Assuming an error cor-recting code on the genome level, we need to find outwhere in the genome it is implemented. This amounts tothe detection of functional elements in the genome, i.e.separating evolutionary noise from meaningful biologi-cal information.

5.1 Detection of functional DNA sequences andconserved non-genic elements

Comprehensive identification of biologically functionalelements in the DNA represents a central and ambi-tious goal in modern genetics. The reliable detectionand analysis of functional elements are crucial stepstowards a deep understanding of how complex organ-isms work. Early approaches to this problem were lim-ited to the use of information from one species. Today,with high quality genome sequences of several speciesat hand, a comparative approach, taking into accountmultiple sources of information, is often used to inferregions in the genome subjected to evolutionary pres-sure. The evolutionary relationship of multiple organ-isms can be described in form of a phylogenetic tree.The common ancestor is represented by the root of thetree. The passage of DNA along the organismal lineagesis described by the branches of the tree. During the pro-cess of evolution, the passed genetic information (DNA)is subjected to mutations that cause variations. Naturalselection decides about the success of the transmittedDNA. Altered information in regions whose variationwill negatively influence the fitness of the organism willmost likely diminish the organisms capability to repro-duce and prevent passing its DNA to the next generationwhereas mutations in regions not being under selectivepressure will be passed on to further generations with-out restrictions. Thus, those elements within the genomecarrying information for important basic functions areless likely to successfully mutate during evolution dueto natural selection. Consequently, by identifying con-served elements in the assembly of the genomes ofseveral species, we find candidates that are very likelyto be functional. Nowadays, having access to the com-plete sequences of a number of vertebrate genomes this

approach provides a powerful tool for the systematic dis-covery of functional elements in the genome [9,17,24].

5.2 Evolution in a communication theoreticframework

In terms of communications engineering the evolutioncan be regarded as a single input multiple output system.In the biological transmission system (evolution), wemay think of the common ancestor as the transmitter.Its sequence of bases is the output of the informationsource. In Fig. 7, a single input multiple output communi-cation scenario and an evolution scenario are depicted.The divergence of lineages, indicated by the inner nodesof the phylogenetic tree, is equivalent to the scattering ofthe dispersing electromagnetic wave on obstacles. Theleaves of the tree correspond to the receiver antennas inthe SIMO system. They receive the sequences that weare able to observe in the species today. The informationis transmitted over the branches of the phylogenetic tree,equivalent to the signal paths in terms of communica-tions theory. Errors (mutations), erasures and insertionsoccur during transmission.

5.3 Modeling evolution

Commonly, the evolution can be described by a set ofparameters [28]. We abstract evolution by a phyloge-netic tree T = {τ , t′} that we specify by a topology τ andthe respective branch lengths t′ accounting for the phy-logenetic relationships and the evolutionary distancesamong the species. A continuous time stationary Markovprocess with state space X = {A,C,G,T} describes themutation process. A rate matrix R defines this Markovprocess and is related to the matrix of transition proba-bilities between two nodes in the phylogenetic tree by

P(tu→v) = eRtu→v , (9)

where tu→v denotes the evolutionary distance betweenthe nodes u and v in the tree. The rate of substitutions at asite is strongly dependent on its position along the DNAsequence as some regions are under purifying selection

Ancestor

Species 1

Species 2

Species 3

Fig. 7 Left Phylogenetic tree relating three species as theyevolved from a common ancestor. Right A single input multipleoutput scenario

Electr Eng (2007) 90:161–173 169

and thus evolve more slowly than neutral regions. Interms of the phylogenetic description, we model rateheterogeneity as a site dependent scaling parameter θi,where i denotes the nucleotide position, working on thelengths of the branches of the tree.

ti = θit′.

The thus influenced absolute evolutionary distances leadto higher or smaller substitution probabilities accordingto (9) and more or less conserved regions. In the follow-ing, we parameterize evolution by the set ψ i containingthe parameters described above

ψ i = {R, τ , θit′}. (10)

Note that evolution is site dependent, theoretically eachsite i could evolve differently. However, in practice overlarge regions of the genome constant values for R andT are assumed.

5.4 Estimation algorithm

Figure 8 shows the transmission model for evolution.The single sequence {xi} is transmitted over the mul-tipath channel evolution. At the receiver, we observethe receive vector sequence {yi} consisting of the ances-tral sequence as we observe it today in the genomes ofthe considered species. The channel is characterized bythe transition probabilities py(yi|xi;ψ i) conditional on xi

and parameterized over ψ i. The channel is not constantfor all input sequences. Different genome regions havebeen subjected to different substitution rates becausethey are subjected to different natural selection pressuredependant of the biological importance of the informa-tion they carry. From this point of view, estimating theconservation of a particular DNA region is equivalentto the estimation of how good the transmission chan-nel was in this region. We will introduce a detectionmethod which, in contrast to earlier approaches [17,24],is independent of the assumption about neutral evolu-tionary rates and which does not require a priori tun-ing parameters. We propose a definition of conservationthat relies on the Kullback–Leibler distance to the welldefined maximum possible conservation that does notallow for any mutations to occur [11]. From a communi-

Fig. 8 Information transmission in evolution

cation theoretic viewpoint, the maximum conservationis equivalent to the case of noiseless transmission, i.ethe base xi is observed unchanged in all components ofthe receive vector yi. In this situation, the channel shallbe specified by py(yi|xi;ψ0) and the receive vector yiis distributed according to py(yi;ψ

0). For the compari-son with the maximum conservation case, we estimatethe evolutionary model that maximizes the likelihoodof an ensemble of received vectors. In a sliding windowover the observed data Y i = [

yi−δ , . . . , yi+δ

], δ fixed, we

determine the evolutionary model ψ̂ i that most likelyled to the observed data. Assuming statistical indepen-dence among the columns of Y i

ψ̂ i = arg maxψ i

⎧⎨

⎩

i+δ∑

j=i−δ

log(py(yj;ψ i))

⎫⎬

⎭ . (11)

We calculate the probability mass function py(yi; ψ̂ i)

for a column parameterized by ψ̂ i and compare theestimated distribution with the one corresponding tothe maximum conservation process using the Kullback–Leibler distance

si = D(

py(yi; ψ̂ i)||py(yi;ψ0)

). (12)

si is the score assigned to the column in the middle ofthe sliding window. Note that a low score correspondsto a good channel and thus a highly conserved region. Ascore of zero is best explained (in the ML sense) by theprocess of maximum conservation. Gaps are treated asmissing data causing the algorithm to consider only thesubtree of species where data is available. A comparisonof the results that we obtained with our method is pre-sented in the next Section. Figure 9 shows our estimationof conservation and the underlying genomic data, andalignment of the genomes of five species. Mutations arehighlighted by colored background. Our distance basedscore signal reflects the different degrees of conserva-tion as one can observe by comparing the signal coursewith the data. Results on synthetic data suggest that ourmethod exceeds the performance of established toolsfrom bioinformatics [11].

5.5 Conserved non-genic sequences

Two to three years ago, when genomes from multiplesequences became available in high quality, the compar-ative methods revealed an unexpected feature of theDNA. It has been discovered that a lot of the con-served genome regions are non-genic, not coding forproteins [8,24]. These regions are believed to haveimportant functions and are still poorly understood. If

170 Electr Eng (2007) 90:161–173

Fig. 9 Top the conservationscores indicating conservedregions. Bottom visualizationof the respective genomicdata, a small section of analignment of the genomes ofhuman, mouse, rat, chickenand fugu

10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

Rel

ativ

e E

ntro

py

Nucleotide Position

Conservation Score

Highly Conserved

HumanRat

MouseChicken

Fugu

G G T G T G G G G G G C C T C A G G T G T G A C C G C T G T G A G C C T G G C T T C T G G A A C T T T C G A G G C A T C G T C A C C G A T G G C C G G A G T G G C T G T A C A C G T G A G T G A C

G G T G T A G G A G G C C T C A G G T G T G A T C G C T G T G A A C C T G G C T T C T G G A A C T T C C G T G G C A T C G T C A C C G A T G G A C A T A G T G G T T G C A C T C G T G A G T A A TG G T G T A G G A G G C C T C A G G T G T G A T C G C T G T G A G C C T G G C T T C T G G A A C T T C C G T G G C A T T G T C A C C G A T G G A C A T A G T G G T T G C A C T C G T G A G T A A TG G T G T T G G G G G G C T C A A G T G T G A C C G C T G T G A G C C T G G C T T C T G G A A C T T C C G T G G C A T C G T C A C C G A C A G C A A G A G C G G C T G C A C G C G T G A G T G C TG G T G T T G G G G G A C A G A A G T G C G A C C G C T G T G A G C C G G G C T T C T G G A A C T T C C G T G G A A T C G T G A C A G A A A A C A T G A G C G G A T G C A C G C G T A A G C C A C

G G T G T GG G T G T GG G T G T GA G T G T GA G T G C G

an error correcting code exists on the genome sequencelevel, we expect the conserved non-genic regions to playa fundamental role in its implementation. Using ouralgorithm to identify conserved regions in the genome,future work will concentrate on the analysis of theseconserved regions with respect to our hypothesis of anerror correcting code on the genome level.

6 Analogy between digital data transmission andtranscription initiation

In digital data transmission the data is often dividedinto frames, whose header contains special patterns thatindicate the beginning of the message in order to main-tain synchronization. These patterns, the “sync words”,need to be detected reliably by the receiver. Similarly,during transcription initiation—the first step of geneexpression—the RNA polymerase has to recognize thepromoter that indicates the beginning of a gene, seeSect. 2.4. In bacteria the RNA polymerase is directedto the promoter by the so called sigma factor. Thissigma factor recognizes two short (six basepairs long)sequences separated by a spacer and positioned 35 and10 basepairs (bp) before transcription start site (TSS).Therefore they are called the −35 and −10 regions.Hence, this process corresponds to a synchronizationwith two sync words in digital data transmission, seeFig. 10.

TSS-35 -10

Sigma factor

Fig. 10 Promoter detection by the sigma factor

6.1 Choice of the sync words in binary and quaternarydigital transmission

The sync words in digital data transmission have to bechosen such that they satisfy the following two condi-tions [2]: firstly, the probability of a random occurrenceof the pattern in the data stream is to be minimized;secondly, the structure of the pattern should be suchthat the preceding symbols cannot yield a shifted syncword, as e.g. if the (binary) pattern is +1+1+1+1+1+1there is a probability of 0.5 (assuming equally probablesymbols) that it is followed by a +1 which may leadto a shifted synchronization. While the probability of arandom occurrence does not depend on the sequencein case of independent symbols, the second conditionis to be analyzed using the aperiodic autocorrelationfunction ϕss(τ ) of the sync word. ϕss(τ ) describes thesimilarity of a sequence s = {s1, s2, . . . , sl} to itself forevery shift τ ∈ [−(l − 1); +(l − 1)]

Electr Eng (2007) 90:161–173 171

ϕss(τ ) =l−|τ |∑

m=1

sm · s∗m+|τ |, (13)

where s∗m denotes the complex conjugate of sm. In order

to minimize the probability of shifted synchronizations,the autocorrelation function of the sync word shouldhave a narrow maxima at τ = 0 and smallest possiblevalues for τ �= 0 [16]. In general, the autocorrelationproperties of a sequence are evaluated using the peaksidelobe (PSL)

PSL = maxτ\{0}[ϕss(τ )], (14)

which should be as small as possible to minimize theprobability of false synchronizations.

6.2 Autocorrelation properties of E.coli promotersequences

As mentioned before, transcription initiation corre-sponds to the process of synchronization used in digitaldata transmission, since two sync words—the promoterregions—need to be detected by the sigma factor. Inorder to gain more insights into promoter detection, wedetermine the autocorrelation properties of the −35 andthe −10 promoter region in the bacterium Escherichiacoli (E.coli) by adapting the autocorrelation function tothe quaternary alphabet of nucleotides. Therefore, wehave to redefine the product in (13) with respect to itsbiological meaning, i.e. such that it rates the effect ofnucleotide matches and mismatches on the synchroni-zation quality of the sequence. We rate an agreementof nucleotides by 1, a divergence of nucleotides by thenegative value − 1

3 (i.e. punishing mismatches with anoverall weight of −1). This is done by introducing a mis-match score matrix D

A C G T

D =

⎛

⎜⎜⎜⎜⎜⎝

1 − 13 − 1

3 − 13

− 13 1 − 1

3 − 13

− 13 − 1

3 1 − 13

− 13 − 1

3 − 13 1

⎞

⎟⎟⎟⎟⎟⎠

A

C

G

T

, (15)

and by replacing the product in (13) by the respectivematrix values

ϕss(τ ) =l−|τ |∑

m=1

D(sm, sm+|τ |). (16)

6.3 Results

The consensus (i.e. most frequently detected) sequencesare TTGACA for the −35 region and TATAAT for the −10

region, respectively (see e.g. [13]). Figure 11 shows theautocorrelation functions of the two sequences. Calcu-lation of the peak sidelobe for both promoter regionsaccording to (14) results in

PSL−35 = ϕss(|τ | = 2) = 0,

PSL−10 = ϕss(|τ | = 3) = 1.67.

To rate the autocorrelation properties of the pro-moter sequences, we calculated the values of PSL forall 46 = 4, 096 possible nucleotide sequences of length6. The mean value and the standard deviation of theresulting values are listed in Table 2.

It can be seen that the PSL of the −35 promotersequence is highly below average, whereas that of the−10 promoter sequence lies above the mean value. Infact, only 1.15% of all possible sequences of length6 have a better or equal PSL than the −35 region.Opposed to that, 79.37% of all sequences have a bet-ter or equal value of PSL compared to the −10 region.This fact suggests that nature employs a synchronizationin two steps: firstly, the −35 region has to be detected outof all possible sequences with high accuracy to enablea reliable localization of the close-by transcription startsite, see Fig. 12. In the second step, both regions aredetected simultaneously, see Fig. 10, however, due tothe synchronization conducted before, the sigma factoronly needs to detect the −10 region out of around seven

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

τ

ϕss

(τ)

-35 consensus TTGACA-10 consensus TATAAT

Fig. 11 Autocorrelation functions of −35 and −10 consensuspromoter

Table 2 Mean and SD of PSL for all possible sequences of length 6

PSL ratio

Mean 1.30SD 0.76

172 Electr Eng (2007) 90:161–173

TSS-35 -10

Sigma factor

Fig. 12 Pre-synchronization during promoter detection

sequences based on the shape and limited deformabilityof the sigma factor that yield a variable spacing of 15 to21 bp between the two promoter regions. Therefore, thesequence of the −10 promoter region is less importantfor synchronization. This brings up the conclusion thatthe two promoters might have evolved in a way to servedifferent tasks: while the −35 region is indispensable forindicating the close-by transcription start site and, thus,needs to have excellent synchronization properties, thesequence and structure of the −10 region seems to playa more important role during later steps of transcrip-tion initiation like DNA unwinding and opening, whichrequire AT-richness (i.e. a high content of the nucleo-tides A and T) [23].

7 Conclusions

The newly available sequence data makes applicationof information theory to molecular genetics statisticallyfeasible. Concepts like mutual information baseddistance measures combined with source coding canbe applied to phylogenetic classification. Mutual infor-mation can be used for gene mapping of complex dis-eases. Additionally, communication theoretic models ofinformation transmission can be used to search for errorcorrecting codes in the genome or to gain better under-standing of the molecular processes in the cell like thetranscription initiation.

References

1. The international hapmap project (2003). Nature 426(6968),789–796. 1476–4687 (electronic) Journal Article MulticenterStudy

2. Barker R (1953) Group synchronization of binary digitalsystems. In: Jackson W (ed) Communication theory, Butter-worths, London pp 273–287

3. Battail G (2006) Introduction to Biosemiotics: informationtheory and error-correcting codes in genetics and biologicalevolution. Springer, Heidelberg

4. Cilibrasi R, Vitani PMB (2005) Clustering by compression.IEEE Trans Inf Theory 51(4):1523-1545

5. Cox T, Cox M (1994) Multidimensional scaling. Chapman &Hall, London

6. Dawy Z, Goebel B, Hagenauer J, Andreoli C, Meitinger T,Mueller JC (2006) Gene mapping and marker clustering usingshannon’s mutual information. IEEE/ACM Trans ComputBiol Bioinform 3(1):47–56

7. Dawy Z, Hagenauer J, Hanus P, Mueller JC (2005) Mutualinformation based distance measures for classification andcontent recognition with applications to genetics. In: Proceed-ings of the ICC 2005

8. Dermitzakis ET, Reymond A, Antonarakis SE (2005) Con-served non-genic sequences—an unexpected feature ofmammalian genomes. Nature Rev Gene 6:151–157 URLhttp://www.dx.doi.org/10.1038/nrg1527

9. Dermitzakis ET, Reymond A, Scamuffa N, Ucla C,Kirkness E, Rossier C, Antonarakis SE (2003) Evolutionarydiscrimination of mammalian conserved non-genic sequences(cngs). Science 302:1033–1035

10. Freeland S, Wu T, Keulmann N (2003) The case for an errorminimizing standard genetic code. Orig Life Evol Biosph33(4–5):457–77

11. Hanus P, Dingel J, Hagenauer J, Mueller J (2005) An alterna-tive method for detecting conserved regions in multiple spe-cies. German conference on bioinformatics, Hamburg, p 64

12. Hayes B (1998) The Invention of the genetic code. Am Sci86(1):8–14

13. Lewin B (2004) GENES VIII. Pearson Prentice Hall UpperSaddle River, NJ

14. Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H(2001) An information-based sequence distance and its appli-cation to whole mitochondrial genome phylogeny. Bioinfor-matics 17(2):149–154. doi: 10.1093/bioinformatics/17.2.149

15. Lolle SJ, Victor JL, Young JM, Pruitt RE (2005) Genome-widenon-mendelian inheritance of extra-genomic information inarabidopsis. Nature 434(7032):505–509

16. Lueke H (1992) Korrelationssignale. Springer, Berlin17. Margulies EH, Blanchette M, Program NCS, Haussler D,

Green ED (2003) Identification and characterization of multi-species conserved sequences. Genome Res. 13(12):2507–2518URL http://www.genome.org/cgi/content/abstract/13/12/2507

18. Mueller J, Bresch E, Dawy Z, Bettecken T, Meitinger T,Hagenauer J (2003) Shannon’s mutual information appliedto population-based gene mapping. Am J Hum Genet 73(5 suppl) 610

19. Nirenberg MW, Matthaei JH (1961) The dependence of cell-free protein synthesis in e. coli upon naturally occurring orsynthetic polyribonucleotides. Proc Natl Acad Sci USA 47:1588–602. 0027–8424 (print)

20. Sarkis M, Goebel B, Dawy Z, Hagenauer J, Hanus P, MuellerJC (2007) Gene mapping of complex diseases—a compari-son of methods from statistics informnation theory, and signalprocessing. IEEE Sign Proc Magaz 24(1):83–90

21. Shannon CE (1940) An algebra for theoretical genetics. Ph.D.thesis, Massachusetts Institute of Technology, Dept. of Math-ematics

22. Shannon CE (1948) A mathematical theory of communica-tion. Bell Syst Techn J 27:379–423

23. Shomer B, Yagil G (1999) Long W tracts are over-representedin the Escherichia coli and Haemophilus influenza genomes.Nucleic Acids Res 27(22):4491–4500

24. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M,Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S,Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W,Haussler D (2005) Evolutionarily conserved elements invertebrate, insect, worm, and yeast genomes. GenomeRes 15(8):1034–1050 URL http://www.genome.org/cgi/con-tent/abstract/15/8/1034

Electr Eng (2007) 90:161–173 173

25. Ueda H, Howson J, Esposito L, Heward J, Snook H,Chamberlain G, Rainbow D, Hunter K, Smith A, Geno-va GD, Herr M, Dahlmand I, Payne F, Smyth D, Lowe C,Twells R, Howlett S, Healy B, Nutland S, Rance H, EverettV, Smink L, Lam A, Cordell H, Walker N, Bordin C, HulmeJ, Motzo C, Cucca F, Hess J, Metzker M, Rogers J, Greg-ory S, Allahabadia A, Nithiyananthan R, Tuomilehto-Wolf E,Tuomilehto J, Bingley P, Gillespie K, Undlien D, Ronningen K,Guja C, Ionescu-Tirgoviste C, Savage D, Maxwell A,Carson D, Patterson C, Franklyn J, Clayton D, Peterson L,Wicker L, Todd J, Gough S (2003) Association of the T-cellregulatory gene CTLA4 with susceptibility to autoimmunedisease. Nature 423(6939):506–511

26. Ureta-Vidal A, Ettwiller L, Birney E (2003) Comparativegenomics: genome-wide analysis in metazoan eukaryotes. NatRev Genet 4(4):251–262. 1471-0056 (print)

27. Wallace DC, Lott MT, Kogelnik AM, Brown MD, Navathe SB(1999) MITOMAP: a human mitochondrial genome database.URL http://www.dhgp.de/

28. Whelan S, Li P, Goldman N (2001) Molecular phylogenet-ics: state-of-the-art methods for looking into the past. TrendsGene 17(5):262–272

Documents

Information and communication theory in molecular biologybiitcomm/research/references/Pavol Hanus... · Electr Eng (2007) 90:161–173 DOI 10.1007/s00202-007-0062-6 ORIGINAL PAPER