Upload
wen-zhu
View
212
Download
0
Embed Size (px)
Citation preview
A condensed 3D graphical representation of RNA secondary structures
Wen Zhu a, Bo Liao a,b,*, Kequan Ding b
a School of Computer and Communication, Hunan University, Changsha, Hunan 410082, Chinab Science 100 Lab, Graduate School of the Chinese Academy of Sciences, Beijing 100049,China
Received 19 December 2004; received in revised form 25 April 2005; accepted 25 April 2005
Available online 02 August 2005
Abstract
In this paper, we propose a condensed 3D graphical representation of RNA secondary structures. Based on this representation, we outline an
approach by constructing a 3-component vector whose components are the normalized leading eigenvalues of the L/L matrices associated with
RNA secondary structure. The examination of similarities/dissimilarities among the secondary structure at the 3 0-terminus of different viruses
illustrates the utility of the approach.
q 2005 Elsevier B.V. All rights reserved.
Keywords: RNA secondary structure; Similarity; Virus; 3D Graphical representation
1. Introduction
Ribonucleic acid (RNA) is an important molecule which
performs a wide range of functions in the biological system. In
particular, it is RNA (not DNA) that contains genetic
information of virus such as HIV and therefore regulates the
functions of such virus. RNA has recently become the center of
much attention because of its catalytic properties, leading to an
increased interest in obtaining structural information [1–2].
Using the similar methods presented by Randic et al. [3–9,11],
one also can outline several graphical representation of RNA
primary sequences based on 2D and 3D to compute the
similarity of RNA primary sequences. Now, we will consider
the similarity of RNA secondary structures [10,12].
Previously, almost all such comparisons are based on
alignments of RNA structures: a distance function or a score
function is used to represent insertion, deletion, and substi-
tution of letters in the compared structures. Using the distance
function, one can compute similarity between RNA structures.
There are many algorithms for computing the similarity
between RNA secondary structures [13–19]. But, by using
these approaches, the chemical structures and the chemical
0166-1280/$ - see front matter q 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.theochem.2005.04.042
* Corresponding author. Address: School of Computer and Communication,
Hunan University, Changsha, Hunan 410082, China. Tel.: C86 10 88256148;
fax: C86 10 88256147.
E-mail address: [email protected] (B. Liao).
properties are ignored, and there is a restriction of non-
crossing.
In this paper, based on the classifications of bases and base
pairs, we shall propose a condensed 3D graphical represen-
tation, which avoids the limitation associated with non-
crossing. We make a comparison for the secondary structures
belonging to nine different species based on this graphical
representation. In Fig. 1, the secondary structures belonging to
nine different viruses are listed, which were reported by Bol
[20]. Based on the order of free bases A, G, U, C and base pairs
A–U, C–G, we shall reduce a RNA secondary structure into
three curves. We construct a 3-component vector consisting of
the normalized leading eigenvalues of the L/L matrices. The
similarities are computed by calculating the Euclidean distance
between the end point of the vectors or calculating the
correlation angle of the two vectors.
2. 3D Graphical representation of RNA secondary
structures
The secondary structure of an RNA is a set of free bases and
base pairs forming hydrogen bonds between A–U and G–C. Let
A 0, U 0, G 0, C 0 denote A, U, G, C in the base pair A–U and G–C,
respectively. Then we can obtain a special sequence represen-
tation of the secondary structure. We call it the characteristic
sequence of the secondary structure. For example, the
corresponding characteristic sequence of the substructure of
AlMV-3 (Fig. 2) is G 0G 0G 0AAUC 0C 0C 0 (from 3 0 to 5 0).
Let GZg1, g2,., be an arbitrary characteristic sequence of
RNA secondary structure, we have a map f, which maps G into
Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198
www.elsevier.com/locate/theochem
Fig. 1. Secondary structure at the 3 0-terminus of RNA 3 of alfalfa mosaic virus (AlMV-3 [21]), citrus leaf rugose virus (CiLRV-3 [22]), tobacco streak virus (TSV-3
[23,24]), citrus variegation virus (CVV-3 [22]), apple mosaic virus (APMV-3 [25]), prune dwarf ilarvirus (PDV-3 [26]), lilac ring mottle virus (LRMV-3 [27]), elm
mottle virus (EMV-3 [28]) and asparagus virus II (AVII [29]). Numbering of nucleotides is from the 3 0end of RNA 3.
Fig. 2. Substructure of AlMV-3.
W. Zhu et al. / Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198194
a plot set. Explicitly, f(G)Zf(g1), f(g2),., where
fðgiÞ Z
ð1; 0;Ai CUiÞ if gi 2fA;Ug
ðK1; 0;Ci CGiÞ if gi 2fC;Gg
ð1; 1;A0i CU 0
iÞ if gi 2fA0;U 0g
ðK1;K1;C 0i CG0
iÞ if gi 2fC 0;G0g
8>>>>><>>>>>:
In other words, we will reduce a RNA secondary structure
into a series of nodes P0, P1, P2,.,PN, whose coordinates xi,
Table 1
Cartesian 3D coordinates for the substructure of AlMV-3 (Fig. 2)
Base Nucletic x y z
1 G 0 K1 K1 1
2 G 0 K1 K1 2
3 G 0 K1 K1 3
4 A 1 0 1
5 A 1 0 2
6 U 1 0 3
7 C 0 K1 K1 4
8 C 0 K1 K1 5
9 C 0 K1 K1 6
W. Zhu et al. / Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198 195
yi, zi (iZ0, 1, 2,., N, where N is the length of the RNA
secondary structure being studied) satisfy
xi Z 1; yi Z 0; zi Z Ai CUi if gi 2fA; Ug
xi ZK1; yi Z 0; zi Z Ci CGi if gi 2fC; Gg
xi Z 1; yi Z 1; zi Z A0i CU 0
i if gi 2fA0; U 0g
xi ZK1; yi ZK1; zi Z C 0i CG0
i if gi 2fC 0; G0g
8>>>>><>>>>>:
where Ai, Ci, Gi, Ui, A0i, U 0i , C 0
i , and G0i are the cumulative
occurrence numbers of A, C, G, U, A 0, U 0, C 0 and G 0,
respectively, in the subsequence from the first base to the i-th
base in the sequence. For example, the corresponding plot set of
the substructure of AlMV-3 (Fig. 2) is {(K1, K1, 1), (K1, K1,
2), (K1, K1, 3), (1, 0, 1), (1, 0, 2), (1, 0, 3), (K1, K1, 4), (K1,
K1, 5), (K1, K1, 6)}. In Table 1, we show (x, y, z) coordinates
of the substructure of AlMV-3 (Fig. 2).
We called the corresponding plot set characteristic plot set.
The curve connecting all plots of the characteristic plot set in
turn is called characteristic curve. In Fig. 3, we show the
characteristic curves representing the first nine bases of the
substructure of AlMV-3 (Fig. 2).
Bases of RNA can be classed into groups: purine (A,
G)/pyrimidine (C, U), amino (A, C)/keto (G, U), week-H bond
Fig. 3. The 3D graphical representation of the substructure of AlMV-3 based on
pattern AU. The dots denote the bases making up the substructure.
(A, U)/strong-H band (G, C). We can obtain only three
representations corresponding to the three classifications. We
call them pattern AG, AC and AU, respectively. The map f, f 0
and f 00 correspond to the pattern AU, AC and AG, respectively.
In Figs. 4 and 5 we show the curves of the substructure of
AlMV-3 based on pattern AC and AG, respectively.
f0ðgiÞ Z
ð1; 0; Ai CCiÞ if gi 2fA; Cg
ðK1; 0; Ui CGiÞ if gi 2fU; Gg
ð1; 1; A0i CU 0
i Þ if gi Z A0
ðK1; 1; A0i CU 0
i Þ if gi Z U 0
ðK1; K1; G0i CC 0
iÞ if gi Z G0
ð1; K1; G0i CC 0
iÞ if gi Z C 0
8>>>>>>>>>>><>>>>>>>>>>>:
f00ðgiÞ Z
ð1; 0; Ai CGiÞ if gi 2fA; Gg
ðK1; 0; Ui CCiÞ if gi 2fC; Ug
ð1; 1; A0i CU 0
i Þ if gi Z A0
ðK1; 1; A0i CU 0
i Þ if gi Z U 0
ð1; K1; G0i CC 0
iÞ if gi Z G0
ðK1; K1; G0i CC 0
iÞ if gi Z C 0
8>>>>>>>>>>><>>>>>>>>>>>:
3. Similarities/dissimilarities among the RNA secondary
structures of nine virus
In order to find some of the invariants sensitive to the RNA
secondary structure we will transform the 3D representation of
the RNA secondary structure into another mathematical object,
a matrix. Once we have a matrix representing a RNA secondary
structure, we can use some of matrix invariants as descriptors
of the structure. One of the matrices is the L/L matrix whose
elements li,j are defined as the quotient of the Euclidean
distance between a pair of vertices(dots) of the characteristic
curve and the sum of distances between the same pair of
Fig. 4. The 3D graphical representation of the substructure of AlMV-3 based on
pattern AC. The dots denote the bases making up the substructure.
Fig. 5. The 3D graphical representation of the substructure of AlMV-3 based on
pattern AG. The dots denote the bases making up the substructure.
W. Zhu et al. / Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198196
vertices measured along the characteristic curve. In other
words,
li;j Zdi;j
PjK1
kZi
dk;kC1
where di,j is the Euclidean distance between a pair of vertices.
Its eigenvalues, and in particular its leading eigenvalue can be
used as descriptors of a RNA secondary structure. Among all
eigenvalues the leading eigenvalue of a matrix, l1, often plays
a special role. In the case of the adjacency matrix of trees,
Lovasz and Pelikan [30] suggested the leading eigenvalue l1 as
an index of molecular branching. More recently it was shown
Table 2
The leading eigenvalues of the L/L matrices associated with three essentially diffe
Patterns AIMV-3 CiLRV-3 TSV-3 CVV-3
AU 12.9727 17.5897 15.4061 15.3904
AC 13.0909 15.0653 14.7057 15.8327
AG 12.1478 17.1360 13.4306 16.3975
Table 3
The similarity/dissimilarity matrix for the coding sequences of Fig. 1 based on the
normalized leading eigenvalues of the L/L matrices
Species AIMV-3 CiLRV-3 TSV-3 CVV-3 A
AIMV-3 0 0.0487 0.0547 0.0483 0
CiLRV-3 0 0.0692 0.0538 0
TSV-3 0 0.0454 0
CVV-3 0 0
APMV-3 0
LRMV-3
PDV-3
EMV-3
AVII
that the leading eigenvalue of a substituted path matrix, ll1,
gives even better characterization of molecular branching [31–
34].
We will characterize the coding sequences of the RNA
secondary structure of nine species by means of the leading
eigenvalue of the L/L matrix. In Table 2 we give the leading
eigenvalues of the L/L matrices associated with three
essentially different patterns of the characteristic curves
representing each of the coding sequences.
Next, we will illustrate the use of the 3D quantitative
characterization of RNA secondary structure with an examin-
ation of similarities/dissimilarties among the nine coding
sequences shown in Fig. 1. We construct a 3-component vector
consisting of the normalized leading eigenvalue l1/n, where l1
is the leading eigenvalue of matrix L/L, n is the length of the
coding sequences. All the vectors rooted at the (0, 0, 0)
position. The underlying assumption is that if two vectors point
to a similar direction in three-dimensional space, then the two
RNA secondary structures represented by the 3-component
vectors are similar.
The similarities among such vectors can be computed in two
ways: (1) calculating the Euclidean distance between the end
points of the vectors; (2) calculating the correlation angle of
two vectors. The smaller the Euclidean distance between the
end points of two vectors, the more similar are the RNA
secondary structures. And, the smaller the correlation angle
between two vectors, the more similar are the RNA secondary
structures.
The Euclidean distance measure between vector end points
and the correlation angle between vectors are different
measures of the similarity of RNA secondary structures.
Observing Tables 3 and 4 we find that there exists an overall
qualitative agreement among similarities. In general, the
correlation angle is the best tolerance for the similarities.
rent patterns of the characteristic curves for the coding sequences of Fig. 1
APMV-3 LRMV-3 PDV-3 EMV-3 AVII
13.5551 12.4008 16.0753 11.1649 13.5015
15.3412 11.8365 16.1528 10.4996 12.6272
13.3476 13.8402 15.6737 12.1459 14.3504
Euclidean distances between the end points of the 3-component vectors of the
PMV-3 LRMV-3 PDV-3 EMV-3 AVII
.0411 0.1503 0.1093 0.1725 0.1312
.0807 0.1442 0.1187 0.1674 0.1251
.0916 0.1053 0.1629 0.1247 0.0860
.0785 0.1078 0.1474 0.1316 0.898
0.1830 0.0860 0.2053 0.1650
0 0.2544 0.0250 0.0200
0 0.2781 0.2357
0 0.0427
0
Table 4
The similarity/dissimilarity matrix for the coding sequences of Fig. 1 based on the angle between the 3-component vectors of the normalized leading eigenvalues of
the L/L matrices
Species AIMV-3 CiLRV-3 TSV-3 CVV-3 APMV-3 LRMV-3 PDV-3 EMV-3 AVII
AIMV-3 0 0.0861 0.0308 0.056551 0.0481 0.0990 0.0198 0.0919 0.0833
CiLRV-3 0 0.0867 0.0733 0.1284 0.0599 0.0734 0.0495 0.0417
TSV-3 0 0.0822 0.0731 0.1166 0.0453 0.1077 0.0978
CVV-3 0 0.0724 0.0501 0.0375 0.0474 0.0435
APMV-3 0 0.1225 0.0551 0.1189 0.1130
LRMV-3 0 0.0794 0.0108 0.0207
PDV-3 0 0.0727 0.0645
EMV-3 0 0.0103
AVII 0
W. Zhu et al. / Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198 197
Observing Tables 3 and 4, we find the more similar species
pairs are AVIIwEMV-3, EMV-3wLRMV-3, AVIIwLRMV-
3 and PDVwAlMV-3.
4. Conclusion
We have presented a similarity measure between RNA
secondary structures. A simple graphical representations
substitute the complicated molecular structure. The advantage
of our approach is that it allows visual inspection of data,
helping in recognizing major similarities among different RNA
structures, and allows one to construct numerical characteriz-
ation. In our approach, the insertion, deletion, and substitution
of plots of curves correspond to the insertion, deletion, and
substitution of letters in the compared structures, respectively.
One difference from the alignments of RNA secondary
structures is that our approach considers not only sequence
structures but also chemical structures for RNA secondary
structures. Another difference from the algorithms for
computing the similarity between RNA secondary structures
is that there is no restriction of non-crossing in our approach.
Acknowledgements
The authors would like to thank the anonymous referees for
many valuable suggestions that have improved this manuscript.
References
[1] Bo Liao, Tian-ming Wang, General combinatorics of RNA hairpins and
cloverleaves, J. Chem. Inf. Comput. Sci. 43 (4) (2003) 1138–1142.
[2] Bo Liao, Tianming Wang, General combinatorics of RNA secondary
structure, Math. Biosci. 191 (2004) 69–81.
[3] Milan Randic, Marjan Vracko, Nella Lers, Dejan Plavsic, Analysis of
similarity/dissimilarity of DNA sequences based on novel sequences
based on novel 2-D graphical representation, Chem. Phys. Lett. 371
(2003) 202–207.
[4] Bo Liao, Tianming Wang, New 2D Graphical representation of DNA
sequences, J. Comput. Chem. 25 (11) (2004) 1364–1368.
[5] M. Randic, Alexandru T. Balanba, On a four-dimensional representation
of DNA primary sequences, J. Chem. Inf. Comput. Sci. 40 (2000) 50–56.
[6] Chunxin Yuan, Bo Liao, Tianming Wang, New 3-D graphical
representation of DNA sequences and their numerical characterization,
Chem. Phys. Lett. 379 (2003) 412–417.
[7] Bo Liao, Tianming Wang, Analysis of similarity/dissimilarity of DNA
sequences based on 3-D graphical representation, Chem. Phys. Lett. 388
(2004) 195–200.
[8] Bo Liao, Tianming Wang, Analysis of similarity of DNA sequences based
on triplets, J. Chem. Inf. Comput. Sci. 44 (2004) 1666–1670.
[9] Bo Liao, Tianming Wang, 3-D graphical representation of DNA
sequences and their numerical characterization, J. Mol. Struct.
THEOCHEM 681 (2004) 209–212.
[10] Bo Liao, Kequan Ding, Tianming Wang, On a six-dimensional
representation of RNA secondary structures, J. Biomol. Struct. Dyn. 22
(2005) 455–464.
[11] Bo Liao, On a 2D graphical representation of DNA sequence, Chem.
Phys. Lett. 401 (2005) 196–199.
[12] Bo Liao, Tianming Wang, A 3D graphical representation of
RNA secondary structures, J. Biomol. Struct. Dyn. 21 (6) (2004) 827–832.
[13] V. Bafna, S. Muthukrisnan, R. Ravi, Comparing similarity between RNA
strings, Comput. Sci. 937 (1995) 1–4.
[14] F. Corpet, B. Michot, RNAlign program: alignment of RNA sequences
using both primary and secondary structures, Comput. Appl. Biosci. 10
(4) (1995) 389–399.
[15] S.Y. Le, R. Nussinov, J.V. Mazel, Tree graphs of RNA secondary structures
and their comparsion, Comput. Biomed. Res. 22 (1989) 461–473.
[16] S.Y. Le, J. Onens, R. Nussinov, J.H. Chen, B. Shapiro, J.V. Mazel, RNA
secondary structures: comparsion and determination of frequently recurring
sunstructures by consensus, Comput. Biomed. 5 (1989) 205–210.
[17] B. Shapiro, An algorithm for comparing multiple RNA secondary
structures, Comput. Appl. Biosci. 4 (3) (1998) 387–393.
[18] B. Shapiro, K. Zhang, Comparing multiple RNA secondary structures using
tree comparisons, Comput. Appl. Biosci. 6 (4) (1990) 309–318.
[19] K. Zhang, Computing similarity between RNA secondary structures,
Proceedings of the IEEE International Joint Symposium on Intelligence
and Systems, Rockviue, MD, May 1998, 126–132.
[20] Chantal B.E.M. Reusken, John F. Bol, Structural elements of the 3 0-
terminal coat protein binding site in alfalfa mosaic virus RNAs, Nucleic
Acids Res. 14 (1996) 2660–2665.
[21] E.C. Koper-Zwarthoff, F.Th. Brederode, P. Walstra, J.F. Bol, Nucleic
Acids Res. 7 (1979) 1887–1900.
[22] S.W. Scott, X. Ge, J. Gen. Virol. 76 (1995) 957–963.
[23] E.C. Koper-Zwarthoff, F.Th. Brederode, P. Walstra, J.F. Bol, Nucleic
Acids Res. 8 (1980) 3307–3318.
[24] B.J. Cornelissen, H. Janssen, D. Zuidema, J.F. Bol, Nucleic Acids Res. 12
(1984) 2427–2437.
[25] R.H. Alrefai, P. Shicl, L.L. Domier, C.J. D’Arcy, P.H. Berger,
S.S. Korban, J. Gen. Virol. 75 (1994) 2847–2850.
[26] S.W. Scott, X. Ge, J. Gen. Virol. 76 (1995) 1801–1806.
[27] E.J. Bachman, S.W. Scott, G. Xin, V. Bowman Vance, Virology 201
(1994) 127–131.
[28] F. Houser-Scott, M.L. Baer, K.F. Liem, J.M. Cai, L. Gehrke, J. Virol. 68
(1994) 2194–2205.
[29] EMBL/GenBank/DDBJ databases. Accession no. X86352.
W. Zhu et al. / Journal of Molecular Structure: THEOCHEM 757 (2005) 193–198198
[30] L. Lovasz, J.I. Pelikan, On the eigenvalues of trees, Period. Math. Hung. 3
(1973) 175–182.
[31] M. Randic, On structural ordering and branching of acyclic saturated
hydrocarbons, J. Math. Chem. 24 (1998) 345–358.
[32] M. Randic, D. Plavsic, M. Razinger, Double invariants MATCH 35
(1997) 243–259.
[33] M. Randic, X. Guo, S. Bobst, Use of matrices for characterization of
molecular structures in: P. Hansen, P. Folwer, M. Zheng (Eds.), Discrete
Mathematical Chemistry DIMACS Workshop on Discrete Mathematical
Chemistry, American Mathematical Society, Providence, RI, 2000,
pp. 305–322.
[34] M. Randic, On molecular branching, Acta Chim. Sloven. 44 (1997) 57–77.