Algebraic properties of DNA operations

BioSystems 52 (1999) 55–61

Algebraic properties of DNA operations�

Zhuo LiDepartment of Computer Science, Uni6ersity of Western Ontario, London, Ont., N6A 5B7, Canada

Abstract

Any DNA strand can be identified with a word in the language X* where X={A, C, G, T}. By encoding A as 000,C as 010, G as 101, and T as 111, we treat the DNA operations concatenation, union, reverse, complement, annealingand melting, from the algebraic point of view. The concatenation and union play the roles of multiplication andaddition over some algebraic structures, respectively. Then the rest of the operations turn out to be the homomor-phisms or anti-homomorphisms of these algebraic structures. Using this technique, we find the relationship amongthese DNA operations. © 1999 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Insertion; Deletion; Code; Generator

www.elsevier.com/locate/biosystems

1. Introduction

This paper is a first attempt to treat the DNAoperations from an algebraic point of view. Itoffers a new formalization of DNA operations,based on the notations defined in (Boneh et al.,1997).

In 1994, Leonard Aldeman (Adleman, 1994)successfully solved an instance of the directedHamiltonian path problem solely by manipulatingDNA strands. Following (Adleman, 1994), DNAalgorithms have been found to solve other prob-lems: expansion of symbolic determinants (Leeteet al., 1998), matrix multiplication (Oliver, 1998),addition (Guarnieri and Bancroft, 1998), exascalecomputer algebra (Williams and Wood, 1998),

and so on. The basic idea behind the DNA com-puting is to use DNA strands to encode informa-tion and employ enzymes to simulate simplecomputations. This shows that the tools of biol-ogy could be widely used in solving mathematicalproblems. Conversely, it is natural and necessaryto study the DNA strands and the operations ofthe DNA strands in a mathematically precise way.This is the motivation of this paper.

Let us consider the DNA strings which are theDNA strands with polarity ignored. Then it isnatural to relate these strings with formal lan-guage and power series. More precisely, let X={A, C, G, T}, where A, C, G, T represent thebuilding blocks of a DNA strand. called nucle-otides: adenine, guanine, cytosine, and thymine.The single nucleotides are linked together end-to-end to form DNA strands. The DNA sequencehas a polarity: a sequence of DNA is distinct fromits reverse. The two ends are known under the

� This paper has been prepared for the Proceedings of theFourth International conference on DNA-Based Computers,Philadelphia, June 1998.

E-mail address: [email protected] (Z. Li)

0303-2647/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved.

PII: S 0303 -2647 (99 )00032 -5

Z. Li / BioSystems 52 (1999) 55–6156

name of the 5% end and 3% end, respectively. Takenas pairs, the nucleotides C and G and the nucle-otides, A and T are said to be complementary.Two complementary single strands with oppositeorientation will join together to form a doublehelix in a process called annealing. The reverseprocess is called melting (refer to Kari, 1997 formore details).

We encode A by 000, C by 010, G by 101, andT by 111. The reason for the choice of ourencoding is the following. It would seem naturalto encode nucleotides A, C, G, and T as 00, 01, 10and 11, respectively. However, this encoding thatsatisfies the condition that A. =�(00)=11=Tand C. =�(01)=10=G, does not satisfy the nat-ural requirement that the reverse of a nucleotide isthe nucleotide itself. Indeed, CR= (01)R=10=G,which is not what we expect. In contrast, ourencoding satisfies both natural constraints men-tioned above.

Let �={0,1}. Let S be any commutative semir-ing (Definition 2.2). Let S��*� and S�X*�be the corresponding sets of power series (Defini-tion 2.3) over �* and X*, respectively. ThenS��*� and S�X*� are semirings (example 2in section 2). where the multiplication on bothsemirings is the generalization of concatenation oftwo words over �* and X*, respectively, and theaddition on both semirings is the generalization ofthe union of two words over �* and X*, respec-tively. Then under our encoding scheme. S�X*� is subsemiring of S��*� . Next, wegeneralize the operations of DNA strings, reverse,denoted by R, complement, denoted by � toanti-automorphism and automorphism (Defini-tion 2.6) of the semiring S��*� . The subsemir-ing S�X*� of S��*� is invariant underthese two morphisms. That is the automorphism� is still an autonlolphisrn of S�X*� and theanti-automorphism is still an anti-automorphismof S�X*� .

To consider the polarity of the DNA strands,we introduce semimodules (Definition 2.4) anddirect sum of two semimodules (Definition 2.5).To get a rough idea of what a semimodule and adirect sum are, the reader could think that thesemimodule is a vector space and the direct sum isCartesian product of two vector spaces. Each

element of the direct sum has two components. Inour interpretation, the first component could bethought as the DNA strand which starts with 5%and ends with 3%, while the second componentcould be thought as the DNA strand which startswith 3% and ends with 5%. This provides a mathe-matical model to simulate the double DNAstrands. The operations of DNA strands denotedby notations , ¡, and M in (Boneh et al., 1997)can then be generalized to homomorphisms fromthe semimodule S�X*� to the direct sum ofS�X*��S�X*� . This brings about the re-lationship among these operators.

2. Background of linear algebra

We introduce some background in mathemat-ics. The interested readers should refer to (Kuichand Salomaa, 1986) for more detail.

Definition 2.1. A monoidBM,·,1\consists of aset M, an associative operation on M and of anidentity 1 such that 1·a=a ·1=a. A monoid iscommutative if and only if a ·b=b ·a for every aand b in M.

Definition 2.2. A semiring is a set S togetherwith two binary operations+and·and two ele-ments 0 and 1 such that1. BS, + , 0\ is a commutati6e monoid,2. BS, ·, 1\ is a monoid,3. the distribution laws a ·(b+c)=a ·b+a ·c and

(b+c)·a=b ·a+c ·a hold for e6ery a, b, and cin S,

4. 0·a=a ·0=0 for e6ery a.If a ·b=b ·a for any a, b�S then S is called a

commutati6e semiring.Example 1.

(i)BB,+ ,·\

is a semiring, where

‘+ ’ is ‘OR’, ‘·’ is ‘AND’, and B={0,1}.

(ii) BX*,·\ is the free monoid generated by anonempty countable set X. It has all the finitestrings, also referred to as words,

x1x2···xn, xi�X,

as its elements and the product w1·w2 is formed by

Z. Li / BioSystems 52 (1999) 55–61 57

writing the string w2 immediately after the stringw1. The identity element is the empty word, de-noted by o.

Definition 2.3. Let S be a semiring. Let X be analphabet, i.e. a finite empty set. A map r : X*�Sis called a formal power series. r itself is written asa formal sum

r= %w�X*

r(w)w.

The values r(w) are referred to as the coefficientsof the series. The collection of all power seriesover X* on S is denoted by S�X*� . Givenr�S�X*� the subset of X* defined by

{w �r(w)"0}

is termed the support of r and denoted by supp(r).The subset of S�X*� consisting of allseries with finite support is denoted by SBX*\ .Series of SBX*\ are referred to as poly-nomials.

Example 2.For r1, r2�S�X*� , define r1+r2 by (r1+

r2)(w)=r1(w)+r2(w) and r1r2 by (r1r2)(w)=�w 1w 2=w r1(w1)r2(w2) for all w�X*. ThenS�X*� is a semiring whose zero element is thesame as the zero element of S and whose identityelement is the empty word o.

Definition 2.4. A left S-semimodule M is acommutative monoidBM,+ ,0\ together with aleft scalar multiplication satisfying for a, b, 1�Sand x, y�M :

(i) a(x+y)=ax+ay,

(ii) (a+b)x=ax+bx,

(iii) (ab)x=a(bx),

(iv) 1x=x, 0x=0, and

(v) a0=0,

One can define the right S-semimodule in a simi-lar way.

Example 3. For a�S, r�S�X*� , if we definethe left scalar product ar by ar=�w�X* ar(w)wthen S�X*� is a left S-semimodule. For a�S,r�S�X*� , if we define the right scalar productra by ra=�w�X* r(w)aw then S�X*� is a rightS-semimodule.

If S is commutative then S�X*� is both leftS-module and right S-module. From now on, weassume that the semiring S is commutative.

Definition 2.5. Let M, N be two left S-semimod-ules. Then the direct sum of M and N, denoted byM�N, is a Cartesian product set M×N of tu-ples (m,n) where m�M and n�N, together with anaddition, a 0 element, and a left scalar multiplica-tion satisfying the following for a�R, mi�M, andni�N for i=1,2.

(i) (m1, n1)+ (m2, n2)= (m1+m2, n1+n2),

(ii) 0= (0, 0), and

(iii) a(m1, n1)= (am1, an1).

Any element in the direct sum M�N is denotedby (m,n) for m�M and n�N.

Example 4. The direct sum S�X*��S�X*� is a left S-semimodule. If we define themultiplication on the direct sum by

(r1, r2) (r3, r4)= (r1r3, r2r4)

and addition on the direct sum by

(r1, r2)+ (r3, r4)= ((r1+r3), (r2+r4))

then S�X*��S�X*� is a semiring whose 0element is (0, 0) and whose identity element (o,o),where o is the empty word of X*. Similarly, sinceS itself is a left S-semimodule S�S is also asemiring whose 0 element is (0,0) and whoseidentity element is (1,1). Note that we can think ofelements of S�X*� as single DNA strands. andof elements of S�X*��S�X*� as doubleDNA strands. Then, the multiplication on S�X*��S�X*� is the ligation of two doubleDNA strands, while the addition is just union oftwo double DNA strands.

A map r : X*×X*�S�S is a power series. ritself can be written as a formal sum

%w 1�X*, w 2�X*

r(w1, w2) (w1, w2).

The collection of all power series over X*×X*on the semiring S�S is denoted by (S�S)(X*×X*). One can define the support of thepower series by

supp(r)={(w1, w2)�r(w1, w2)" (0, 0)}� (X*×X*).

Z. Li / BioSystems 52 (1999) 55–6158

Definition 2.6. A map f from a semiring R to asemiring S is a homomorphism if it satisfies forany a, b�R.

(i) f(a+b)= f(a)+ f(b)(ii) f(ab)= f(a)f(b)

If a homomorphism is both injective and surjec-tive then it is called an isomorphism. A homo-morphism from S to itself is called anendomorphism. In isomorphism from S to itself iscalled an automorphism.

A map f from a semiring R to a semiring S isan anti-homomorphism if it satisfies

(i) f(a+b)= f(a)+ f(b)(ii) f(ab)= f(b)f(a)

Similarly we can define anti-isomorphism, andanti-automorphism.

Proposition 2.1. S�X*��S�X*� is a sub-semiring of (S�S)�X*×X*� up to anisomorphism.

Proof. Given two maps (or power series) r1, r2:X*�S, define a map f from S�X*��S�X*� to (S+S)�X*×X*� by

f((r1, r2))=r when r(w1, w2)= (r1(w1), r2(w2)).

It is easy to check that the map f is an injection.We now need to prove that f is an homomor-phism. In fact, for any (ri, si)�S�X*��S�X*� for i=1, 2,

f((r1, s1) (r2, s2))= f(r1r2, s1s2)

= %w 1, w 2�X*

((r1r2) (w1), (s1s2) (w2)) (w1, w2),

and

f((r1, s1))f((r2, s2))

=� %

w 11, w 12�X*

((r1(w11), s1(w12))(w11, w12))�

·

� %w 21, w 22�X*

((r2(w21), s2(w22))(w21, w22))�

= %w 1, w 2�X*, w 11w 21=w 1, w 12w 22=w 2

(r1(w11)r2(w21), s1(w12)s2(w22)) (w1, w2)

=%w 1, w 2�X* ((r1r2) (w1), (s1s2)(w2)) (w1, w2).

Therefore, f(r1r2, s1s2)= f(r1, s1)f(r2, s2) It is easyto show that

f(r1+r2, s1+s2)= f(r1, s1)+ f(r2, s2).

Hence, f is a homomorphism. (Q.E.D.)By Proposition 2.1, we can define the support

of a power series r�S�X*��S�X*� . Letr= (r1, r2), where r1, r2�S�X*� . Then

supp(r)={(w1, w2)�X*×X* �r(w1, w2)

= (r1(w1), r2(w2))" (0, 0)}.

3. DNA Strings and regular languages

DNA strings are words over {A, C, G, T}*, thefree monoid generated by the alphabet {A, C, G,T} under the concatenation operation (e.g. x=ACCTGAC). The DNA strands are DNA stringswith a polarity (e.g. 3%-ACCTGAC-5%). Now, letus ignore the polarity for the moment, and con-sider all the DNA strings. We encode A by 000, Cby 010, G by 101 and T by 111. Then the follow-ing proposition is easy to show.

Proposition 3.1. Let x, y be two DNA stringencodings. Then x and y are complementary ifand only if �x �= �y � and x XOR y=1···1. Here,the XOR represent the bitwise exclusive OR.

Let S be an arbitrary semiring and let �={0,1}. Now we are ready to build a regular gram-mar G= (N, �, R, Q) such that the set of all theDNA strings is the regular language of L(G),where N={Q} consists of only one nonterminalsymbol Q, the rewriting rules of R are:

Q�000�010�101�111�000Q �010Q �101Q �111Q �.Theorem 3.1. Let G be defined as above. Then

the set of all the single DNA string encodings isthe regular language L(G). Given the S��*� -left linear system

Q=000+010+101+111

+ (000+010+101+111)Q (1)

corresponding to the above regular grammar,there exists a unique quasiregular solution (theo-rem 14.11 in Kuich and Salomaa, 1986), where asolution of Eq. (1) is a power series r�S��*�

Z. Li / BioSystems 52 (1999) 55–61 59

such that r=000+010+101+111+ (000+010+101+111)r and is quasiregular if r(o)=0.

Theorem 3.2. Let r=�w�L(G)w. Then the powerseries r is the only quasireqular solution.

Proof. Replacing Q in Eq. (1) by r, we caneasily show that Eq. (1) holds. (Q.E.D.)

4. DNA Operations

The notations �, R, ¡, , and M for biologicaloperations have been defined in (Boneh et al.,1997). Let x be any DNA string. Then

(i) x is the string that results when each charac-ter of x has been replaced by its complement. Byproposition 3.1, x is obtained by flipping the bitsof x if the string x is thought as a string in �*under our encoding scheme.

(ii) xR is the reverse of a string x.(iii) x denotes the DNA strand whose corre-

sponding DNA string is x and whose polarity is5%�3%.

(iv) ¡x denotes the 3%–5% DNA strand comple-mentary to x.

(v) Mx denotes the double strand that resultswhen x and ¡x anneal in solution.

Example 1. The table below summarizes theseoperators on a DNA string:

x=ACCTGAC

x=TGGACTG

xR=CAGTCCA

xR. =GTCAGGT (= xR=xR)

x=5%−ACCTGAC−3%

¡x=3%−TGGACTG−5%

Mx=!5%−ACCTGAC−3%

3%−TGGACTG−5%.

We generalize these operators to homomorphismsof semirings. For every r�S��*� , we define

r= %w�S*

r(w)w,

rR= %w�S*

r(w)wR,

rR. = %w�S*

r(w)wR. .

Theorem 4.1. The operators R and R. are anti-automorphisms of S��*� . The operator � isan automorphism of S��*� .

Proof. For any two power series r1, r2�S��*� ,

^(r1+r2)= %w�S*

(r1(w)+r2(w))w

= %w�S*

r1(w)w+ %w�S*

r2(w)w= r1+ r2,

and

^(r1r2)= %w�S* w 1w 2=w

(r1(w1)r2(w2))w

= %w�S*

r1(w)w+ %w�S*

r2(w)w= r1r2.

Since � is equivalent to the bitwise exclusive OR,it is one to one. So � is an automorphism ofS��*� .

Similarly, we can show that (r1+r2)R=r1R+r2

R,and, since S is commutative,

(r1r2)R= %w�S* w 1w 2=w

(r1(w1)r2(w2))wR

= %w�S* w 1w 2=w

(r1(w1)r2(w2))w2Rw1

R

= %w�S*

r2(w)wR %w�S*

r1(w)wR=r2Rr1

R.

So R is an anti-automorphism of S��*� . Since

R. =R�=�R,

R. is an anti-automorphism of S��*� (Q.E.D.).Let X={A, C, G, T}. Under our encoding

scheme, X* is a subset of �*. More precisely, X*is a submonoid of �*, that is, X* is closed underthe multiplication of �*. From now on we con-sider X* as a submonoid of �*.

Theorem 4.2. The semiring S�X*� is a sub-semiring of S��*� . It is invariant under theautomorphism � and anti-automorphism R ofS��*� , that is �(S�X*� )=S�X*� andR(S�X*� )=S�X*� .

Proof. First, we show that S�X*� is a sub-semiring of S��*� . It suffices to prove that forany two power series r1, r2�S�X*� , r1+r2, r1r2�S�X*� . The fact r1+r2�S�X*� is

Z. Li / BioSystems 52 (1999) 55–6160

clearly true. Assume that ri=�w�X* ri(w)w fori=1, 2. Then

r1r2= %w, w 1, w 2�X*, w 1w 2=w

(r1(w1)r2(w2))w.

Since w1, w2 are in X* and X* is a monoid, w1w2

belongs to X*. Therefore, r1r2 is contained inS�X*� .

Secondly, we show that S�X*� is invariantunder the morphisms � and R. That is, for anypower series r�S�X*� , rR and r are containedin S�X*� . Since the morphisms have nothingto do with the coefficients of the power series itsuffices to prove that wR and w are contained inX* for any w�X*. By our encoding scheme,(000)R=000; (010)R=010; (101)R=101; and(111)R=111. And A. =�(000)=111=T, C. =�(010)=101=G, G. =�(101)=010=C, andT. =�(111)=000=A. Therefore, wR, w are in X*for any w�X* since w is a word formed byA(=000), C(=010), G(=101) and T(=111).

Corollary 4.1. The compositions, R� and �Rare anti-automorphisms of S�X*� . Moreover,R�=�R.

Since R�=�R we use the notation R. for both.

5. Double strands and annealing

By the Theorem 4.2, the semiring S�X*� is asubsemiring of S�X*� . Hence it is both leftand right S-semimodule (recall that we assumethat S is commutative). Therefore. we can defineS�X*��S�X*� which is also a subsemi-ring of S��*��S��*� .

Now we consider the polarity of the DNAstrings. Suppose that r is a power series in S�X*��S�X*� . An element w�supp(r) is ofthe form (w1,w2). We make the convention thatw1= w1=5%-w1-3%, and w2=3%-w2-5%. For exam-ple assume that (w1,w2)= (ACTG, AG) � supp(r)then what we really mean is that w1=5%-ACTG-3%and w2=3%-AG-5%.

Now we define ij : S�X*��S�X*��S�X*� ,for j=1, 2 by i1(r)=r�0 and i2(r)=0�r.Note that the supports of the images of i1 and i2are empty. Both maps i1 and i2 are homomor-phism of semirings from S�X*� to S�X*�

�S�X*� . Conversely we define pj :S�X*��S�X*��S�X*� for j=1, 2 by

p1(r1�r2)=r1, and p2(r1�r2)=r2.

Then p1 and p2 are homomorphisms of semiringsfrom S�X*��S�X*� to S�X*� .

We identify the operators , ¡, and M withhomomorphisms from S�X*� to S�X*��S�X*� as follows.

= i1,¡= i2,

M(r)= (r� r)

Theorem 5.1. The operators , ¡, and M arehomomorphisms from S�X*� to S�X*��S�X*� .

Proof. The proof is straightforward and similarto the proof of theorem 4.1 (Q.E.D.).

We interpret p1 and p2 as melting and M as theprocess annealing. Note that M describes the pro-cess of a single DNA strand looking for its match-ing single DNA strand to form a double DNAstrand.

Theorem 5.2. The map M �p2 is an endomor-phism of S�X*��S�X*� . The invariant ofthe endomorphism is the set

M(S�X*� )={M(r)�for all r�S�X*�}

which its a subsemiring of S�X*��S�X*� .Proof. Let us consider the following diagram:

S�X*��S�X*�

M�p2

S�X*��S�X*�

��p2

��M

S�X*�

�

S�X*�

Since p2, �, and M are homomorphisms, themap M�p2 is an endomorphism of the semiringS�X*��S�X*� .

Now we should prove that the invariant of theendomorphism M�p2 is not empty.

For any r�S�X*� , M(r)= (r� r) is in theinvariant since

M�p2(M(r))=M�p2(r� r)=M�(r)=Mr=M(r).

��

��

Z. Li / BioSystems 52 (1999) 55–61 61

This also proves that the invariant of the endomor-phism contains the set {M(r)� for all r�S�X*�}.

Conversely, assume that r1�r2�S�X*��S�X*� is an element in the invariant. Then

M�p2(r1�r2)=M�(r2)

=M(r2)= r2� r2

= r2�r2=r1�r2

So r1= r2. Thus r1�r2=M (r2). This shows that theinvariant is contained in the set {M (r)� for allr�S�X*�}.

Hence the invariant of the endomorphism isexactly the same as the set {M (r)� for all r�S�X*�}. It is a subsemiring of S�X*��S�X*� since it is the image of S�X*� under thehomomorphism M (Q.E.D.).

The following theorem summarizes the relation-ships among the operators.

Theorem 5.3.

(i) ^R=R^=Rp2¡=p1 R. =R.(ii) ^p1M=p2M= ^,(iii) p1M is the identity map,(iv) Mx�¡x=Mx for x�S�X*�

Proof. For relation (i), it suffices to prove thatRp2¡=p1 R. =R. . In fact. for any r�S�X*� ,

Rp2¡(r)=Rp2(0� r)=Rr=rR.

p1 R. (r)=p1(rR. �0)=rR. .

The relation (ii) is true since, for any r�S�X*� ,

�p1M(r)=�p1(r� r)= rp2M(r)=p2(r� r)= r.

The relation (iii) is true since, for anyr�S�X*� ,

p1M(r)=p1(r� r)=r.

The relation (iv) is true by the definitions of ¡, and M (Q.E.D.).

Acknowledgements

The author wants to thank Professor Lila Karifor her suggestions and comments.

References

Adleman, L., 1994. Molecular computation of solutions tocombinatorial problems. Science 266, 1021–1024.

Boneh, D., Dunworth, C., Lipton, R.J., 1997. A notation forDNA operations. Unpublished manuscript.

Guarnieri, F., Bancroft, C., 1998. Use of a horizontal chainreaction for DNA-based addition. In: Landweber, L.F.,Baum, E. (Eds.), DNA Based Computers II. DynamicSeries, Vol. 44. AMS Press, pp. 105–111.

Kari, L., 1997. DNA Computing: Arrival of BiologicalMathematics. The Mathematical Intelligencer 2, 9–22.

Kuich, W., Salomaa, A., 1986. Semirings, Automata, Lan-guages. Springer-Verlag.

Leete, T., Schwartz, M., Williams, R., Wood, D., Salem, J.,Rubin, H., 1998. Massively parallel DNA computation:expansion of symbolic determinants. In: Landweber, L.F.,Baum, E. (Eds.), DNA Based Computers II. DynamicSeries, Vol. 44. AMS Press, pp. 45–48.

Oliver, J., 1998. Computation with DNA: matrix multiplica-tion. In: Landweber, L.F., Baum, E. (Eds.), DNA BasedComputers II. Dynamic Series, Vol. 44. AMS Press, pp.113–122.

Williams, R., Wood, D., 1998. Exascale computer algebraproblems interconnect with molecular reactions and com-plexity theory. In: Landweber, L.F., Baum, E. (Eds.),DNA Based Computers II. Dynamic Series, Vol. 44. AMSPress, pp. 267–275.

.

Documents

Algebraic properties of DNA operations