10
Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots Minghui Jiang Abstract—We study three closely related problems motivated by the prediction of RNA secondary structures with arbitrary pseudoknots: the problem 2-Interval Pattern proposed by Vialette [36], the problem Maximum Base Pair Stackings proposed by Leong et al. [16], and the problem Maximum Stacking Base Pairs proposed by Lyngsø [21]. For the 2-Interval Pattern, we present polynomial- time approximation algorithms for the problem over the preceding-and-crossing model and on input with the unitary restriction. For Maximum Base Pair Stackings and Maximum Stacking Base Pairs, we present polynomial-time approximation algorithms for the two problems on explicit input of candidate base pairs. We also propose a new problem called Length-Weighted Balanced 2-Interval Pattern, which is natural in the context of RNA secondary structure prediction. Index Terms—RNA secondary structure prediction, stacking pairs, 2-intervals. Ç 1 INTRODUCTION R NAS are versatile molecules. In the central dogma of biology, three roles for RNA are well known: mRNA as the genetic information carrier, tRNA as the genetic code interpreter, and rRNA as the structural component of the protein synthesis complex. In recent years, a number of other important roles for RNA have been determined. To understand the myriad functions of RNAs in biological processes, we need to first understand their structures. The primary structure of an RNA is the sequence of nucleotides (that is, the four different bases A, C, G, and U) in its single-stranded polymer. An RNA folds into a three- dimensional structure by forming hydrogen bonds between nonconsecutive bases that are complementary, such as the Watson-Crick pairs G-C and A-U and the wobble pair G-U. The three-dimensional arrangement of the atoms in the folded RNA molecule is the tertiary structure; the collection of base pairs in the tertiary structure is the secondary structure. The secondary structure of an RNA is the scaffolding of its tertiary structure. It is well known that RNA folding is hierarchical: “the primary sequence determines the second- ary structure which, in turn, determines its tertiary folding, whose formation alters only minimally the secondary structure” [34]. The prediction of RNA secondary structures given only the primary sequences is one of the most important problems in structural bioinformatics. Although the algorithmic study of RNA secondary structure predic- tion dates back to the 1970s [26], the problem is still not solved. In addition to the nested stem-loop base pairing interaction that is typical in RNA secondary structures, nonnested base pairs can be formed between the loop of one stem and the bases outside that stem, thus, forming a pseudoknot. We refer to Fig. 1 for an example. The accurate prediction of RNA secondary structures with pseudoknots remains challenging. Most early research on RNA secondary structure pre- diction adopts the thermodynamic approach: each possible secondary structure is associated with a global free energy, which depends on the recursive decomposition of the structure and a set of experimentally determined energy parameters; the native secondary structure is assumed to have the minimum global free energy and is, thus, optimal. When pseudoknots are excluded, the optimal secondary structure of an RNA can be computed by dynamic programming algorithms in Oðn 3 Þ time and Oðn 2 Þ space [26], [25], [42], [41], [30], [23]. A recent algorithm by Wexler et al. [37] breaks the Oðn 3 Þ time barrier by an accessibility criterion that filters out substrings of mRNA that are inaccessible to protein and microRNA binding. Consider- able effort has been made to extend the dynamic program- ming algorithms to include pseudoknots [29], [35], [22], [1], [12], [28]. However, these extended algorithms can handle only limited types of pseudoknots [22], [16]; if arbitrary pseudoknots may be included, then the prediction problem becomes exceedingly difficult, typically NP-hard [22], [1], [16], [21]. Even for those limited types of pseudoknots that can be handled, the complexities of the prediction algo- rithms range from Oðn 4 Þ to Oðn 6 Þ in time and from Oðn 2 Þ to Oðn 4 Þ in space, which render them impractical even for sequences of only a few hundred bases. The intractability of the pseudoknot prediction problem with the thermodynamic approach has prompted researchers to explore other approaches. Many researchers believe that RNA structures are not determined by free energy alone but are also influenced by the kinetics of the folding process. Following a kinetic approach, these researchers experiment IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 323 . The author is with the Department of Computer Science, Utah State University, 4205 Old Main Hill, Logan, UT 84322-4205. E-mail: [email protected]. Manuscript received 16 Feb. 2008; revised 26 Sept. 2008; accepted 21 Oct. 2008; published online 24 Oct. 2008. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2008-02-0033. Digital Object Identifier no. 10.1109/TCBB.2008.109. 1545-5963/10/$26.00 ß 2010 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

Embed Size (px)

Citation preview

Page 1: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

Approximation Algorithms for PredictingRNA Secondary Structures with

Arbitrary PseudoknotsMinghui Jiang

Abstract—We study three closely related problems motivated by the prediction of RNA secondary structures with arbitrary

pseudoknots: the problem 2-Interval Pattern proposed by Vialette [36], the problem Maximum Base Pair Stackings proposed by Leong

et al. [16], and the problem Maximum Stacking Base Pairs proposed by Lyngsø [21]. For the 2-Interval Pattern, we present polynomial-

time approximation algorithms for the problem over the preceding-and-crossing model and on input with the unitary restriction. For

Maximum Base Pair Stackings and Maximum Stacking Base Pairs, we present polynomial-time approximation algorithms for the two

problems on explicit input of candidate base pairs. We also propose a new problem called Length-Weighted Balanced 2-Interval

Pattern, which is natural in the context of RNA secondary structure prediction.

Index Terms—RNA secondary structure prediction, stacking pairs, 2-intervals.

Ç

1 INTRODUCTION

RNAS are versatile molecules. In the central dogma ofbiology, three roles for RNA are well known: mRNA as

the genetic information carrier, tRNA as the genetic codeinterpreter, and rRNA as the structural component of theprotein synthesis complex. In recent years, a number ofother important roles for RNA have been determined. Tounderstand the myriad functions of RNAs in biologicalprocesses, we need to first understand their structures.

The primary structure of an RNA is the sequence ofnucleotides (that is, the four different bases A, C, G, and U)in its single-stranded polymer. An RNA folds into a three-dimensional structure by forming hydrogen bonds betweennonconsecutive bases that are complementary, such as theWatson-Crick pairs G-C and A-U and the wobble pair G-U.The three-dimensional arrangement of the atoms in thefolded RNA molecule is the tertiary structure; the collection ofbase pairs in the tertiary structure is the secondary structure.

The secondary structure of an RNA is the scaffolding ofits tertiary structure. It is well known that RNA folding ishierarchical: “the primary sequence determines the second-ary structure which, in turn, determines its tertiary folding,whose formation alters only minimally the secondarystructure” [34]. The prediction of RNA secondary structuresgiven only the primary sequences is one of the mostimportant problems in structural bioinformatics. Althoughthe algorithmic study of RNA secondary structure predic-tion dates back to the 1970s [26], the problem is still notsolved. In addition to the nested stem-loop base pairing

interaction that is typical in RNA secondary structures,nonnested base pairs can be formed between the loop of onestem and the bases outside that stem, thus, forming apseudoknot. We refer to Fig. 1 for an example. The accurateprediction of RNA secondary structures with pseudoknotsremains challenging.

Most early research on RNA secondary structure pre-diction adopts the thermodynamic approach: each possiblesecondary structure is associated with a global free energy,which depends on the recursive decomposition of thestructure and a set of experimentally determined energyparameters; the native secondary structure is assumed tohave the minimum global free energy and is, thus, optimal.When pseudoknots are excluded, the optimal secondarystructure of an RNA can be computed by dynamicprogramming algorithms in Oðn3Þ time and Oðn2Þ space[26], [25], [42], [41], [30], [23]. A recent algorithm by Wexleret al. [37] breaks the Oðn3Þ time barrier by an accessibilitycriterion that filters out substrings of mRNA that areinaccessible to protein and microRNA binding. Consider-able effort has been made to extend the dynamic program-ming algorithms to include pseudoknots [29], [35], [22], [1],[12], [28]. However, these extended algorithms can handleonly limited types of pseudoknots [22], [16]; if arbitrarypseudoknots may be included, then the prediction problembecomes exceedingly difficult, typically NP-hard [22], [1],[16], [21]. Even for those limited types of pseudoknots thatcan be handled, the complexities of the prediction algo-rithms range from Oðn4Þ to Oðn6Þ in time and from Oðn2Þ toOðn4Þ in space, which render them impractical even forsequences of only a few hundred bases.

The intractability of the pseudoknot prediction problemwith the thermodynamic approach has prompted researchersto explore other approaches. Many researchers believe thatRNA structures are not determined by free energy alone butare also influenced by the kinetics of the folding process.Following a kinetic approach, these researchers experiment

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 323

. The author is with the Department of Computer Science, Utah StateUniversity, 4205 Old Main Hill, Logan, UT 84322-4205.E-mail: [email protected].

Manuscript received 16 Feb. 2008; revised 26 Sept. 2008; accepted 21 Oct.2008; published online 24 Oct. 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2008-02-0033.Digital Object Identifier no. 10.1109/TCBB.2008.109.

1545-5963/10/$26.00 � 2010 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

with various heuristics that simulate the RNA foldingkinetics, such as Kinefold [40], P-RnaPredict [38], and Delta[20], to name just a few recent ones. Unlike approximationalgorithms, these heuristics do not offer any theoreticalguarantees: it is impossible to bound how far a givenprediction is from the optimal solution. Another majorapproach is comparative. Given an alignment of homologousRNA sequences, if the bases in two columns of the alignmenttable are always either a G-C pair or an A-U pair in eachsequence, then it is highly probable that a base pair exists atthe two aligned locations in the secondary structures of allthese sequences. The comparative approach [27], [9], [31],[32], [13], [39] makes effective use of such covarianceinformation and can often achieve better prediction accuracythan the thermodynamic and kinetic approaches. Of course,the comparative approach cannot be applied to individualsequences without known homologs, nor does it explain howor why an RNA folds to a particular structure. Nevertheless,in designing algorithms for RNA secondary structureprediction, it is important to make the algorithms flexibleenough to incorporate the additional information fromcomparative analysis whenever available.

In this paper, we study three closely related problemsmotivated by the prediction of RNA secondary structureswith arbitrary pseudoknots and present improved approx-imation algorithms for these problems. The main approachof the three problems is thermodynamic, since they areformulated as combinatorial optimization problems tomodel the free energy minimization of RNA secondarystructures. On the other hand, these problems are alsocompatible with the comparative approach: explicit candi-dates of base pairs and helices, instead of a sequence ofbases, are taken as input.

1.1 2-Interval Pattern

Most dynamic programming algorithms for RNA second-ary structure prediction consider the individual base pairsas the structural unit. At a coarser resolution, a secondarystructure may be built from helices formed by the stacking ofconsecutive base pairs. We refer back to Fig. 1. Given anRNA sequence, a subsequence of consecutive bases can berepresented as an interval on the real line, and the stackedpairing of two disjoint subsequences (a helix) can berepresented as a 2-interval, the union of two disjointintervals. A set of pairwise-disjoint 2-intervals satisfying

certain geometric constraints then gives a compact repre-sentation of an RNA secondary structure.

The 2-interval representation of RNA secondary struc-tures was proposed by Vialette [36]. We review somedefinitions. A 2-intervalD ¼ ðI; JÞ is the union of two disjointintervals I and J on the real line such that I < J , that is, I iscompletely to the left of J . Consider two 2-intervals D1 ¼ðI1; J1Þ and D2 ¼ ðI2; J2Þ. D1 and D2 are disjoint if the fourintervals I1; J1; I2, and J2 are pairwise disjoint. Define threebinary relations for disjoint pairs of 2-intervals:

. Precede: D1 < D2 () I1 < J1 < I2 < J2.

. Nest: D1 D2 () I2 < I1 < J1 < J2.

. Cross: D1ðÞD2 () I1 < I2 < J1 < J2.

Note that the two relations < and are both transitive, butthe relation ðÞ is not.

A model is a nonempty subset of the three binary relationsf<; ; ðÞg. There are exactly seven models. Two 2-intervalsD1 and D2 are R-comparable for some relation R 2 f<; ; ðÞg ifeither ðD1; D2Þ 2 R or ðD2; D1Þ 2 R. A set D of 2-intervals isR-structured over some model R if any two 2-intervals in Dare R-comparable for some relation R 2 R. We refer Fig. 2for an example.

Given a set D of 2-intervals and a model R, theproblem 2-Interval Pattern [36] is to find a maximum sizesubset of R-structured 2-intervals in D. If each intervalD 2 D is associated with a nonnegative weight wðDÞ, thenwe also have the problem Weighted 2-Interval Pattern [11],which is to find a maximum weight subset of R-structured2-intervals in D.

The relation ðÞ is especially important in the context ofRNA secondary structure prediction because any twocrossing helices (2-intervals) form a pseudoknot. Thegeneral model f<; ; ðÞg of 2-Interval Pattern corresponds tosecondary structures with arbitrary pseudoknots and thereduced model f<; g corresponds to pseudoknot-freesecondary structures.

Besides the seven models, various restrictions may beimposed on the input 2-intervals. Define the support of aset D of 2-intervals as the set of intervals fI; J j ðI; JÞ 2 Dg.There are three common types of restrictions as follows:

. Point: The intervals in the support are pairwisedisjoint, and therefore, can be viewed as points.(Note that each interval in the support may beshared by many 2-intervals; the 2-intervals many notbe disjoint although the intervals are disjoint.)

. Unitary: All intervals in the support have the samelength.

. Balanced: The left and right intervals of each 2-intervalhave equal length.

The point and unitary restrictions were introduced byVialette [36]; the balanced restriction was later proposed byCrochemore et al. [11]. The point restriction is simple but

324 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010

Fig. 2. An example: D2 D1; D1 ðÞ D3; D2 < D3; fD1; D2; D3g is

f<; ; ðÞg-structured; fD1; D2g is f g-structured.

Fig. 1. Pseudoknot PKb of upstream pseudoknot domain (UPD) of

3’-UTR of beet soil-borne virus RNA 1, from PseudoBase [4] at http://

wwwbio.leidenuniv.nl/~batenburg/PKBase/PKB00116.html. Sequence

of bases follows the thick curve; base pairs are connected by thin lines.

Page 3: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

useful: 2-intervals with the point restriction can directlyrepresent individual base pairs in modeling RNA second-ary structures at the base pair resolution; they can alsorepresent helices formed by a set of disjoint RNA fragments(2-intervals with a set of disjoint intervals as the support).The unitary restriction is convenient in RNA foldingsimulation with a uniform handling of helices; differenthelices may have different lengths, but each long helix canbe approximated by several short helices of equal length.The balanced restriction is also natural in the biologicalsetting; a helix formed by stacking base pairs always has anequal number of bases in each half (unless it contains bulgesand small internal loops).

Since Vialette’s pioneering work [36], the problem2-Interval Pattern has been extensively studied [36], [8],[10], [11], [17], [18], [19]. We summarize in Table 1 thecomplexities of the problem over the seven models andwith various restrictions. Since the relation ðÞ modelpseudoknots, it is not surprising that the 2-Interval Patternis NP-hard over the model f<; ðÞg and APX-hard over themodels f<; ; ðÞg and f ; ðÞg, even with the unitary restric-tion on input. This is consistent with the hardness resultsfor related problems [1], [22], [16], [21] and our practicalknowledge that RNA secondary structures with pseudo-knots are difficult to predict. Naturally, researchers havedirected their attentions to the design of approximationalgorithms.

We refer to Table 2 for the best approximation ratios forthe Weighted 2-Interval Pattern. In this paper, we present twonew approximation algorithms that improve the previousresults. For the Weighted 2-Interval Pattern on input with theunitary restriction, we obtain an improved constant-factorapproximation. For Weighted 2-Interval Pattern over themodel f<; ðÞg, we obtain a polynomial-time approximationscheme (PTAS), that is, a polynomial-time ð1þ �Þ-approx-imation algorithm for any constant � > 0.

We note that 2-Interval Pattern over the model f<; ðÞg isone of the most interesting variants of 2-Interval Pattern.The model f<; ðÞg captures many important pseudoknotted

structures. We refer to Fig. 3 for a special f<; ðÞg-structuredpattern. Lyngsø and Pedersen [22] noted that this “chainof pseudoknots” structure is particularly useful in com-paring existing algorithms by classifying the differenttypes of pseudoknots they can handle. Bereg et al. [5] alsostudied an optimization problem for RNA secondarystructures with such “loop chains,” in connection withthe classic concept of the longest common subsequence.There are two intriguing open questions concerning2-Interval Pattern over the model f<; ðÞg. The first openquestion is about its complexity on input with the pointrestriction. As we can see in Table 1, this is the last variantof 2-Interval Pattern whose complexity remains unknown.The second open question, posed by Crochemore et al.[11], is whether 2-Interval Pattern over the model f<; ðÞg isAPX-hard, as over the other two models f<; ; ðÞg andf ; ðÞg. We settle the second open question in the negative,by presenting a polynomial-time approximation schemefor Weighted 2-Interval Pattern over the model f<; ðÞg.

1.2 Maximum Base Pair Stackings and MaximumStacking Base Pairs

Algorithms for 2-Interval Pattern treat the helices as thestructural unit; their predictions may be affected by thecoarse resolution. It is sometimes desirable to obtain moreprecise predictions of RNA secondary structures byconsidering the base pairs individually. Algorithms at thebase pair resolution often adopt the nearest neighbor energymodel [1], [22], [16], [21] in which the energy of each basepair depends not only on its two bases but also on the otheradjacent base pairs. According to the Tinoco model [33], anRNA structure can recursively be decomposed into loopswith independent free energy; the energy of each loop is anaffine function in the number of unpaired bases and thenumber of interior base pairs. The only type of loopswithout unpaired bases are formed by two stacking basepairs; the negative energy of such stacking loops stabilizesthe RNA structure.

We review some concepts related to base pair stacking.Let S be an RNA sequence. Denote by S½i� the ith base of Sand by S½i; j� the subsequence S½i�S½iþ 1� � � �S½j�. A base pairðS½i�; S½j�Þ of S, or ði; jÞ in short, is a pair of bases S½i� andS½j� such that iþ 1 < j (the two bases must be nonconse-cutive in the sequence). A stacking loop is a loop of fourbases formed by two adjacent base pairs ði; jÞ andðiþ 1; j� 1Þ. A set of m base pairs fði1; j1Þ; . . . ; ðim; jmÞg isa secondary structure if the 2m indices ik and jk are alldistinct (each base can participate in at most one base pair).For a secondary structure S, define the number of base pairstackings as

bpsðSÞ ¼ jfði; jÞ 2 S j ðiþ 1; j� 1Þ 2 Sgj;

JIANG: APPROXIMATION ALGORITHMS FOR PREDICTING RNA SECONDARY STRUCTURES WITH ARBITRARY PSEUDOKNOTS 325

TABLE 1Complexities of 2-Interval Pattern

yL ¼ Oðn2Þ and d ¼ OðnÞ.

TABLE 2Best Approximation Ratios for Weighted 2-Interval Pattern

Contributions from this paper are marked by “old ! new.”y2:0þ � for unit weight.

Fig. 3. A chain of pseudoknots.

Page 4: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

and the number of stacking base pairs as

sbpðSÞ¼jfði; jÞ 2 S j ðiþ 1; j� 1Þ 2 S or ði� 1; jþ 1Þ 2 Sgj:

Note that the number of base pair stackings is exactly thenumber of stacking loops in a secondary structure.Denote by ð½i; iþ k�; ½j� k; j�Þ a helix of k consecutivestacking loops formed by kþ 1 base pairs ði; jÞ; ðiþ 1;j� 1Þ; . . . ; ðiþ k; j� kÞ. For a helix H of h base pairs,h � 2, we have bpsðHÞ ¼ h� 1 and sbpðHÞ ¼ h. Forexample, the secondary structure in Fig. 1 has 10 basepairs, 7 base pair stackings, and 9 stacking base pairs.

Ieong et al. [16] introduced the problem Maximum BasePair Stackings: given an RNA sequence, find a secondarystructure with the maximum number of base pair stackings.Ieong et al. showed that Maximum Base Pair Stackings is NP-hard when the output secondary structure is restricted to beplanar. Their construction uses the Watson-Crick base pairsA-U and C-G. Later, Lyngsø [21] demonstrated thatMaximum Base Pair Stackings without the planar restrictionis still NP-hard, even for binary sequences with 0-1 basepairs. Lyngsø also showed that the related problemMaximum Stacking Base Pairs, which is to find a secondarystructure with the maximum number of stacking base pairs,also becomes NP-hard when the input sequence is over analphabet � of unbounded size and the legal pair types forma subset of �� �.

Several algorithms have been proposed for the twoproblems. For Maximum Base Pair Stackings over thecanonical alphabet {A, C, G, U} with the Watson-Crick basepairs, Ieong et al. [16] presented an Oðn3Þ time 2-approxima-tion for the planar case and anOðnÞ time 3-approximation forthe general case. Lyngsø [21] presented a polynomial-timeexact algorithm for Maximum Stacking Base Pairs and apolynomial-time approximation scheme for Maximum BasePair Stackings, both over a fixed-size alphabet � and with asubset of �� � of legal pair types.

In the original formulations of the two problems [16],[21], the candidate base pairs that may appear in asecondary structure are given implicitly by specifying theset of legal pair types (either the Watson-Crick base pairs ora subset of �� �). As an alternative, the set of candidatebase pairs may be given explicitly as input, in a way similarto the input set of 2-intervals for the problem 2-IntervalPattern. For example, we may incorporate the additionalinformation from comparative analysis by composing acandidate set of base pairs with pair covariances at leastsome threshold value. On explicit input of candidate basepairs, the two optimization problems are still NP-hard, sincethey generalize the original formulations. It is natural thatwe investigate their approximation algorithms. Lyngsø’salgorithms [21], besides having very high complexities of�ðn81Þ time and �ðn80Þ space even for the canonical alphabet{A, C, G, U}, depend on a table-lookup technique assumingthat the candidate base pairs are implicitly determined bythe legal pair types; therefore, they cannot be adapted toexplicit input. Ieong et al.’s 3-approximation for MaximumBase Pair Stackings [16], on the other hand, may be adaptedto explicit input of candidate base pairs. In this paper, weimprove Ieong et al.’s 3-approximation for Maximum BasePair Stackings to a ð8=3Þ-approximation, and present the first

nontrivial ð5=2Þ-approximation for Maximum Stacking BasePairs, both on explicit input of candidate base pairs.

The rest of the paper is organized as follows: In Section 2,we present a polynomial-time approximation scheme forWeighted 2-Interval Pattern over the model f<; ðÞg. Thetechnique is dynamic programming. In Section 3, wepresent three approximation algorithms: a ð8=3Þ-approx-imation for Maximum Base Pair Stackings and a ð5=2Þ-approximation for Maximum Stacking Base Pairs, both onexplicit input of candidate base pairs, and a ð2:5þ �Þ-approximation for Weighted 2-Interval Pattern on input withthe unitary restriction. The three results are all obtained bythe same technique of local improvement in 5-claw-freegraphs. In Section 4, we introduce a new problem calledLength-Weighted Balanced 2-Interval Pattern and pose anopen question.

2 A PTAS BASED ON DYNAMIC PROGRAMMING

In this section, we present a polynomial-time approxima-tion scheme for Weighted 2-Interval Pattern over the modelf<; ðÞg. Our main result is the following theorem.

Theorem 1. There is an algorithm that approximates Weighted 2-Interval Pattern over the model f<; ðÞg with a ratio of 1þ 1=cand runs in Oðn2cþ3Þ time, for any fixed integer c � 2.

We first introduce some preliminaries. For two 2-intervals D1 ¼ ðI1; J1Þ and D2 ¼ ðI2; J2Þ, define a compositebinary relation ðÞ< such that

D1 ðÞ< D2 () D1 < D2 or D1 ðÞ D2:

From the definitions of the two relations < and ðÞ,

D1 < D2 () I1 < J1 < I2 < J2

and

D1ðÞD2 () I1 < I2 < J1 < J2;

it follows that

D1ðÞ< D2 ¼) I1 < I2 and J1 < J2:

Therefore, just as the relation < specifies a total order fordisjoint intervals, the relation ðÞ< specifies a total order forf<; ðÞg-structured 2-intervals. A set S of f<; ðÞg-structured2-intervals can be considered as a sequence ordered by therelation ðÞ< . Denote by S½i� the ith element in S. Denoteby S½i; j� the subsequence S½i�S½iþ 1� � � � S½j�.

Define the backbone elements of a sequence S as follows:

1. S½1� is a backbone element.2. If S½i� is a backbone element, and if S½i� < S½j� andS½i� ðÞ S½k� for all i < k < j, then S½j� is also abackbone element.

For two consecutive backbone elements S½i� and S½j�,define the stripe T ði; jÞ as the subsequence S½iþ 1; j� 1�.By definition, each stripe is a set of fðÞg-structurednonbackbone elements.

We refer to Fig. 4 for a sequence of eight f<; ðÞg-structured2-intervals. The 2-intervals at indices 1, 4, 7, and 8 are the fourbackbone elements. The stripe between the two consecutivebackbone elements 1 and 4 consists of two nonbackbone

326 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010

Page 5: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

elements 2 and 3; the stripe between 4 and 7 consists of 5 and 6;the stripe between 7 and 8 is empty.

We say that a sequence is c-striped if it contains at mostc consecutive nonempty stripes, that is, among everycþ 1 consecutive stripes, there must be an empty stripe.For example, the sequence in Fig. 4 is 2-striped. A sequenceS with backbone elements at indices i1; i2; . . . ; ik has apattern of alternating backbone elements and stripes:

S½i1�T ði1; i2ÞS½i2�T ði2; i3ÞS½i3�T ði3; i4ÞS½i4�T ði4; i5ÞS½i5� � � �

The sequence S itself may not be c-striped, but it alwayscontains c-striped subsequences. For example, the followingtwo subsequences of S are both 1-striped:

S½i1�T ði1; i2ÞS½i2� S½i3�T ði3; i4ÞS½i4� S½i5� � � �S½i1� S½i2�T ði2; i3ÞS½i3� S½i4�T ði4; i5ÞS½i5� � � �

The two subsequences together cover the sequence S: eachbackbone element is covered twice; each nonbackboneelement is covered once. Hence, one of the two subse-quences has at least half the weight of the sequence S. Thisobservation immediately suggests that a 2-approximationfor Weighted 2-Interval Pattern over the model f<; ðÞg can beobtained by finding a maximum weight 1-striped sequence.This is indeed the idea behind the previous 2-approxima-tion [17] for 2-Interval Pattern over the model f<; ðÞg, whichcomputes a maximum weight 1-striped sequence bydynamic programming. In this paper, we extend this ideafurther to obtain a polynomial-time approximation scheme.

Fix an integer c � 2. For each k; 0 � k � c, construct a c-striped subsequence of S by deleting the stripes T ðij; ijþ1Þsuch that j mod ðcþ 1Þ ¼ k:

k ¼ 0 : S½i1�T ði1; i2ÞS½i2� � � � S½icþ1� S½icþ2� � � �k ¼ 1 : S½i1� S½i2� � � � S½icþ1�T ðicþ1; icþ2ÞS½icþ2� � � �

..

.

The cþ 1 subsequences together cover the sequence c times:

each backbone element is covered cþ 1 times; each

nonbackbone element is covered c times. Therefore, the total

weight of the cþ 1 subsequences is at least c times the weight

of the sequence S. By the pigeon-hole principle, the weight

of at least one of the cþ 1 subsequences is at least ccþ1 times

the weight of S. This implies that a ð1þ 1=cÞ-approximation

for Weighted 2-Interval Pattern over the model f<; ðÞg can be

obtained by computing a maximum weight c-striped

sequence.We now present a dynamic programming algorithm to

find a maximum weight c-striped sequence. For simplicity,we only demonstrate how to compute the maximum weightof a c-striped sequence. The actual sequence with themaximum weight can be reconstructed in the same running

time using standard dynamic programming techniques.The algorithm is described in the following at several levelsof detail, from the highest to the lowest.

Given a set D of 2-intervals as input, we first constructtwo zero-weight dummy elements A and Z such that A <

D < Z for all elements D 2 D, then extend the input set Dwith A and Z. Since the two dummy elements have zeroweight, the maximum weight of a c-striped sequenceremains the same after the extension, and we can con-veniently assume that A and Z, respectively, are the firstand last elements of the optimal c-striped sequence.

The first element of a sequence of f<; ðÞg-structured2-intervals, by definition, is always a backbone element. Ifthe last element of the sequence is also a backbone element,then we say that the sequence is canonical. For example, thesequence in Fig. 4 is canonical, but the sequence in Fig. 3 is not.

For each element D 2 D, denote by wðDÞ the weight of D,denote by W ½D� the maximum weight of a canonical c-striped sequence anchored between A and D (that is, with Aand D, respectively, as the first and last elements), anddenote by W0½D� the maximum weight of such a canonicalsequence satisfying an additional constraint that its laststripe (the stripe between the second-to-last and the lastbackbone elements) is empty. The entry W ½Z� gives themaximum weight of a canonical c-striped sequence an-chored between A and Z, and is the optimal solution.

We next show how to compute the two tables W ½D� andW0½D�. Define a chain as a canonical sequence with at mostcþ 1 backbone elements (and hence, at most c stripes). Forevery pair of elements C;D 2 D; C < D, denote by w½C;D�the maximum weight of a chain anchored between C and D.Note that a canonical c-striped sequence is a concatenation ofdisjoint chains separated by empty stripes. Therefore, thetablesW ½D� andW0½D� can be computed with the recurrence

W0½D� ¼ maxC<DfW ½C� þ wðDÞg;

W ½D� ¼ maxW0½D�maxC<DfW0½C� � wðCÞ þ w½C;D�g;

( ð1Þ

and the base condition W ½A� ¼W0½A� ¼ 0. Note that the twoelements C and D in the recurrence are the second-to-lastand the last backbone elements. Now the problem reducesto computing the chain table w½C;D�.

Define an i-chain as a chain with exactly iþ 1 backboneelements (hence, exactly i stripes). Then every chain is an i-chain for some i between 1 and c. For each pair of elementsC;D 2 D; C < D, and for each i; 1 � i � c, denote bywi½C;D� the maximum weight of an i-chain anchoredbetween C and D. We have

w½C;D� ¼ max1�i�c

wi½C;D�: ð2Þ

The problem then reduces to computing the i-chain tablewi½C;D� for each i.

Denote by wi½B1; B2; . . . ; Biþ1� the maximum weight ofan i-chain with B1; B2; . . . ; Biþ1 as its iþ 1 backboneelements. We have

wi½C;D� ¼ maxC¼B1<B2<���<Biþ1¼D

wi½B1; B2; . . . ; Biþ1�: ð3Þ

JIANG: APPROXIMATION ALGORITHMS FOR PREDICTING RNA SECONDARY STRUCTURES WITH ARBITRARY PSEUDOKNOTS 327

Fig. 4. Backbone elements and stripes.

Page 6: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

The problem further reduces to computing wi½B1; B2; . . . ;Biþ1� for each set of backbone elements B1; B2; . . . ; Biþ1.

Each element D in the stripe between two consecutivebackbone elements Bk and Bkþ1, by definition, satisfies theconstraint Bk ðÞ D ðÞ< Bkþ1. For each k; 1 � k � i, define

Dk ¼ fD 2 D j Bk ðÞ D ðÞ< Bkþ1g:

The problem then reduces to selecting a subset D0k � Dk foreach k; 1 � k � i, such that the set of elements

fB1g [ D01 [ fB2g [ . . . [ D0i [ fBiþ1g

forms an i-chain with the maximum weight. With specifiedbackbone elements B1; . . . ; Biþ1, the weight of an i-chain ismaximized when the weight of its nonbackbone elements,D01 [ . . . [ D0k, is maximized.

We next show how to compute the maximum weight ofnonbackbone elements in an i-chain with specified back-bone elements. The crucial observation here, from thedefinition of backbone elements and stripes, is that theelements from two nonconsecutive stripes always satisfythe relation < and are, hence, “independent.” Therefore,when selecting nonbackbone elements, it is sufficient toconsider only their interactions between consecutive strips.To be precise, we only need to ensure that the subsets D0ksatisfy the following two conditions:

1. The elements in D0k are fðÞg-structured, 1 � k � i.2. The elements in D0k are disjoint from the elements inD0kþ1; 1 � k � i� 1.

We introduce some more notations. For a 2-interval D,denote by LðDÞ and RðDÞ, respectively, the left and rightintervals of D. For an interval I, denote by lðIÞ and rðIÞ,respectively, the coordinates of the left and right endpointsof I. Assume without loss of generality that all coordinatesare integers between 1 and 4n.

To compute the maximum weight of nonbackboneelements, we again use dynamic programming. We use iþ1 coordinates x1; x2; . . . ; xiþ1 as parameters. Associate witheach coordinate xk, a valid range ½x0k; x00k� is defined as follows:

. x01 ¼ rðLðB1ÞÞ and x0k ¼ rðRðBk�1ÞÞ for 2 � k � iþ 1;

. x00k ¼ lðRðBkÞÞ for 1 � k � iþ 1.

We refer to Fig. 5 for an example: the three solid verticallines mark the three coordinates x1; x2, and x3; the threeshaded areas illustrate their ranges. We note two facts: 1) theranges ½x0k; x00k� are nonoverlapping; 2) each element D in Dksatisfies x0k � rðLðDÞÞ � x00k and x0kþ1 � rðRðDÞÞ � x00kþ1.

Denote by w½x1; x2; . . . ; xiþ1� the maximum weight of thesubsets D0k that satisfy, in addition to conditions 1 and 2, the

following condition 3 that further limits the choice ofcandidate elements from Dk:

3. Each element D in D0k satisfies x0k � rðLðDÞÞ � xkand x0kþ1 � rðRðDÞÞ � xkþ1; 1 � k � i.

The table w½x1; . . . ; xiþ1� can be computed with therecurrence

w½x1; . . . ; xiþ1� ¼

max

max1�k�iþ1

w½x1; . . . ; xk�1; xk � 1; xkþ1; . . . ; xiþ1�max1�k�i

maxð½a;xk�;½b;xkþ1�Þ¼D2Dk

fw½x1; . . . ; xk�1; a� 1; b� 1; xkþ2; . . . ; xiþ1� þ wðDÞg

8>><>>:

ð4Þ

and the base condition w½x01; . . . ; x0iþ1� ¼ 0. The recurrenceconsiders two cases: 1) increment xk without adding anyelement; 2) increase both xk and xkþ1 after adding anelement D 2 Dk to D0k. We again refer to Fig. 5 for anexample: the transition of the two vertical lines for x1 andx2, from dashed to solid, illustrates the recurrence step thatadds the element 3 to D01.

The entry w½x001 ; . . . ; x00iþ1� gives the maximum weight ofnonbackbone elements. Adding the weight of the backboneelements, we have

wi½B1; . . . ; Biþ1� ¼ w½x001 ; . . . ; x00iþ1� þXiþ1

k¼1

wðBkÞ: ð5Þ

This completes the description of our algorithm.We now give an analysis of the running time. Given n

2-intervals as input, we can assume that the coordinates oftheir endpoints are between 1 and 4n. For each set ofbackbone elements B1; . . . ; Biþ1, it takes Oðniþ2Þ time tocompute w½x1; . . . ; xiþ1� and wi½B1; . . . ; Biþ1� by (4) and (5).For each i; Oðniþ1Þ sets of backbone elements are enumer-ated by (3). It, thus, takes Oðniþ1Þ �Oðniþ2Þ ¼ Oðn2iþ3Þ timeto compute the i-chain table wi½C;D�. The time to computethe chain table w½C;D� by (2) is thenX

1�i�cOðn2iþ3Þ ¼ Oðn2cþ3Þ:

With w½C;D� computed, it takes Oðn2Þ additional time tocompute W ½D� and W0½D� by (1). The total time of thealgorithm isOðn2cþ3Þ. This completes the proof of Theorem 1.

Our ð1þ 1=cÞ-approximation is achieved by solving aspecial c-striped sequence optimally. From this result, wegain the insight that the most difficult case of the problemhappens when the optimal structure contains very longchains of pseudoknots. This is intuitively consistent withthe result by Lyngsø and Pedersen [22], who showed thatthe “chain of pseudoknots” structure (Fig. 3) becomesincreasingly difficult to handle as the chain length increases.

3 APPROXIMATION ALGORITHMS BASED

ON 5-CLAW-FREE GRAPHS

In this section, we present three approximation algorithms,for 2-Interval Pattern, Maximum Base Pair Stackings, andMaximum Stacking Base Pairs.

328 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010

Fig. 5. Maximizing weight of nonbackbone elements with specified

backbone elements.

Page 7: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

We first introduce some preliminaries. Let D be a set of2-intervals. Without loss of generality, assume that eachinterval in the support of D is a closed segment ½a; b�between two integer coordinates a and b. Define the lengthof an interval as the number of integer coordinates (whichcorrespond to the individual bases) it contains; the length of½a; b� is b� aþ 1. Define the 2-interval graph GðDÞ as the(undirected) graph with a vertex for each 2-interval in Dand with an edge between a pair of vertices if and only ifthe corresponding 2-intervals intersect. A d-claw C is aninduced subgraph K1;d of an undirected graph; it consists ofan independent set TC of d vertices, called talons, and acenter vertex zC that is connected to all the talons. A graphis d-claw free if it has no d-claws. We prove an importantproperty of 2-interval graphs that will be used in all threeapproximation algorithms in this section.

Lemma 1. For a set D of 2-intervals with minimum intervallength � and maximum interval length � (� � 2) in thesupport, the 2-interval graph GðDÞ is ð5þ 2b��2

� cÞ-claw free.

Proof. Let I be an arbitrary interval in the support of D. LetI I be an arbitrary set of disjoint intervals in the supportof D that intersects the interval I. All intervals in I I ,except maybe the leftmost and the rightmost intervals,are completely contained in I. The leftmost and right-most intervals (if two such intervals exist) occupy at leasttwo integer coordinates in I. The total number ofintervals in I I is at most 2 (for the leftmost andrightmost intervals) plus b��2

� c (for the intervals in themiddle). Each 2-interval in D consists of two intervals inthe support of D, and hence, has at most

2 2þ � � 2

� �� �¼ 4þ 2

� � 2

� �

independent neighbors in the 2-interval graph GðDÞ. tu

3.1 2-Interval Pattern

The problem 2-Interval Pattern over the general model f<;; ðÞg is essentially the problem Maximum Independent Set in

2-interval graphs. Similarly, Weighted 2-Interval Pattern overthe model f<; ; ðÞg is Maximum Independent Set in 2-intervalgraphs. From Lemma 1, each 2-interval graph is a d-claw-free graph for some d. We will use two results for d-claw-free graphs, both based on a local improvement techniqueas follows:

1. For Maximum Independent Set in ðkþ 1Þ-claw-freegraphs, k � 4, Halldorsson1 [14] presented anOðnlogk 1=�Þ time ðk=2þ �Þ-approximation algorithmfor any constant � > 0.

2. For Maximum Weight Independent Set in d-claw-freegraphs, Berman [6] presented a polynomial-timeðd=2þ �Þ-approximation algorithm for any constant� > 0.

Our algorithms for 2-Interval Pattern and Weighted2-Interval Pattern over the general model f<; ; ðÞg are verysimple: on an input set D of 2-intervals, construct the2-interval graph GðDÞ, then obtain an approximation using

one of the two algorithms [14], [6]. Crochemore et al. [11]noted that, on an input set of n 2-intervals, 2-Interval Patternand Weighted 2-Interval Pattern over the reduced modelf ; ðÞg reduce to OðnÞ subproblems over the general modelf<; ; ðÞg. Therefore, with an extra OðnÞmultiplicative factorin the running time, any approximation for the generalmodel f<; ; ðÞg also extends to the reduced model f ; ðÞg.From Lemma 1, the following theorem immediately follows.

Theorem 2. Given a set of 2-intervals with minimum intervallength � and maximum interval length � (� � 2) in thesupport, 2-Interval Pattern over the models f<; ; ðÞg andf ; ðÞg can be approximated with a ratio of 2þ b��2

� c þ � inpolynomial time, and Weighted 2-Interval Pattern over themodels f<; ; ðÞg and f ; ðÞg can be approximated with a ratioof 2:5þ b��2

� c þ � in polynomial time, for any constant � > 0.

Now consider the 2-Interval Pattern and Weighted 2-IntervalPattern on input with the unitary restriction. Given a set of2-intervals with the unitary restriction, the minimum length�and the maximum length � of the intervals in the support arethe same. If� ¼ � � 2, then we can directly apply Theorem 2.If � ¼ � ¼ 1, then the 2-intervals indeed satisfy the pointrestriction. We refer back to Table 1. It is known that, on inputwith the point restriction, 2-Interval Pattern (indeed Weighted2-Interval Pattern too) over the models f<; ; ðÞg and f ; ðÞg canbe solved exactly in polynomial time [24], [36], [10]. We haveproved the following corollary.

Corollary 1. Given a set of 2-intervals with the unitaryrestriction, 2-Interval Pattern over the models f<; ; ðÞg andf ; ðÞg can be approximated with a ratio of 2þ � inpolynomial time, and Weighted 2-Interval Pattern over themodels f<; ; ðÞg and f ; ðÞg can be approximated with aratio of 2:5þ � in polynomial time, for any constant � > 0.

3.2 Maximum Base Pair Stackings and MaximumStacking Base Pairs

We introduce some more preliminaries. Two stacking loopsare consecutive if they share a base pair. For example, letL1 be astacking loop with base pairs ði; jÞ and ðiþ 1; j� 1Þ, andL2 bea stacking loop with base pairs ðiþ 1; j� 1Þ and ðiþ 2; j� 2Þ.Then, the two stacking loops L1 and L2 are consecutive andtogether form a helix ð½i; iþ 2�; ½j� 2; j�Þ with three stackingbase pairs ði; jÞ; ðiþ 1; j� 1Þ, and ðiþ 2; j� 2Þ. We nowprove the following lemma for stacking loops.

Lemma 2. An interval of k bases can intersect at most kþ 1stacking loops in a secondary structure. If an interval ofk bases intersects exactly kþ 1 stacking loops in a secondarystructure, then the kþ 1 stacking loops must be consecutive.

Proof. Each base can participate in at most one base pair. Iftwo stacking loops in a secondary structure share a base,then they must also share the base pair that contains theshared base. Then the two stacking loops must beconsecutive, formed by three stacking base pairs, withthe shared base pair in the middle. Similarly, if twostacking loops in a secondary structure share twoconsecutive bases (note that two consecutive basescannot form a base pair), then the two stacking loopsmust share the two base pairs that contain the twoshared bases, that is, the two stacking loops are, in fact,

JIANG: APPROXIMATION ALGORITHMS FOR PREDICTING RNA SECONDARY STRUCTURES WITH ARBITRARY PSEUDOKNOTS 329

1. Halldorsson [14] credited Hurkens and Schrijver [15] for similarresults.

Page 8: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

the same stacking loop. Therefore, two (distinct) stackingloops cannot share two consecutive bases.

A stacking loop that intersects an interval ½a; b� ofk bases always contains two consecutive bases of theextended interval ½a� 1; bþ 1� of kþ 2 bases. Theextended interval has kþ 1 pairs of consecutive bases½a� 1; a�; ½a; aþ 1�; . . . ; ½b; bþ 1�. Since two stacking loopsdo not share two consecutive bases, the interval ½a; b�intersects at most kþ 1 stacking loops. If the interval½a; b� intersects exactly kþ 1 stacking loops, then eachstacking loop contains a distinct pair of consecutivebases in the extended interval ½a� 1; bþ 1�. For the 2first stacking loops that contain the two pairs ½a� 1; a�and ½a; aþ 1�, they share the base a, and hence, areconsecutive. By induction, these kþ 1 stacking loopsmust be consecutive. tu

We now present a ð8=3Þ-approximation algorithm forMaximum Base Pair Stackings that, given a sequence S ofn bases and a candidate set C of m base pairs as input,outputs a set S of base pair stackings (stacking loops) whosesize is at least 3=8 the maximum number of base pairstackings. The algorithm initializes S to empty, unmark allbases in S, then perform the following four steps:

1. Repeat whenever possible: Find the leftmost 5consecutive stacking loops, that is, a helix ð½a; aþ5�; ½b� 5; b�Þ such that a is as small as possible,formed by base pairs in C with unmarked bases in S,add these stacking loops to S, then mark their bases.

2. Repeat whenever possible: Find any 4 consecutivestacking loops formed by base pairs with unmarkedbases, add them to S, then mark their bases.

3. Repeat whenever possible: Find any 3 consecutivestacking loops formed by base pairs with unmarkedbases, add them to S, then mark their bases.

4. Construct a set D of weighted 2-intervals for theremaining stacking loops formed by base pairs withunmarked bases: a 2-interval of weight 1 for eachsingle stacking loop and a 2-interval of weight 2 fortwo consecutive stacking loops. Find an indepen-dent set I in the 2-interval graph GðDÞ usingBerman’s ð5=2Þ-approximation algorithm for Max-imum Weight Independent Set in 5-claw-free graphs[6]. For each 2-interval in I , add the correspondingstacking loops to S.

We note that the first three steps of our algorithm aresimilar to the first two steps of the GreedySP algorithm byIeong et al. [16]: our first step corresponds to the first step ofGreedySP with parameter i ¼ 5; our second and third stepscorrespond to the second step of GreedySP with parametersk ¼ 4 and k ¼ 3 (both i and k are parameters of theGreedySP algorithm).

We also note that Berman’s algorithm for MaximumWeight Independent Set in d-claw-free graphs [6] achievesonly a ðd=2þ �Þ-approximation in general: when theweights of the vertices are superpolynomial in the numberof vertices, a rescaling procedure, at the price of anadditional � in the approximation ratio, is necessary toensure a polynomial number of iterations. However, in thefourth step of our algorithm, each vertex in GðDÞ has an

integer weight of either 1 or 2, so the rescaling procedure isunnecessary and a ð5=2Þ-approximation can be obtained.

We next show that our algorithm indeed achieves að8=3Þ-approximation for Maximum Base Pair Stackings. LetS be an optimal set of stacking loops. We will show thatjSj=jSj � 8=3.

Let s1; s2; s3, and s4, respectively, be the numbers ofstacking loops in S found by the first, second, third, andfourth steps of our algorithm. Let s1; s

2, and s3, respectively,

be the numbers of stacking loops in S that intersect thestacking loops found by the first, second, and third steps ofour algorithm. Let s4 be the number of remaining stackingloops in S. Note that the remaining stacking loops in S,which do not intersect the stacking loops found by the firstthree steps of our algorithm, must correspond to a subset of2-intervals in D in the fourth step of our algorithm.

In the first step, our algorithm repeatedly selects the

leftmost five consecutive stacking loops, a balanced

2-interval D5. Since the interval length of D5 is 6, each

interval of D5 intersects at most 6þ 1 ¼ 7 stacking loops in

S. We show that a tighter estimate than 7 is possible for the

left interval of D5. Suppose that the left interval intersects

7 stacking loops. Then it follows from Lemma 2 that these

7 stacking loops are consecutive; 5 of the 7 stacking loops

should have been chosen as the leftmost 5 consecutive

stacking loops instead of D5, a contradiction. Therefore, D5

intersects at most 6þ 7 ¼ 13 stacking loops in S. We have

the following inequality:

s1s1� 6þ 7

5¼ 13

5¼ 2:6: ð6Þ

In the second step, our algorithm repeatedly selects any

four consecutive stacking loops, a balanced 2-interval D4 of

length 5. We show that each interval of D4 intersects at

most five stacking loops in S. Suppose, on the contrary,

that an interval intersects six stacking loops in S. Then,

again from Lemma 2, it follows that the six stacking loops

must be consecutive and, consequently, contain five

consecutive stacking loops. This is a contradiction since

no five consecutive stacking loops remain after the first

step. We have the following inequality:

s2s2� 5þ 5

4¼ 5

2¼ 2:5: ð7Þ

A similar analysis for the third step shows the followinginequality:

s3s3� 4þ 4

3¼ 8

3 2:67: ð8Þ

In the fourth step, each 2-interval in D is balanced and

corresponds to either a single stacking loop with interval

length 2 or two consecutive stacking loops with interval

length 3. It follows from Lemma 1 that the 2-interval graph

GðDÞ is 5-claw-free. Berman’s ð5=2Þ-approximation algo-

rithm for Maximum Weight Independent Set in 5-claw-free

graphs [6] guarantees that

s4s4� 5

2¼ 2:5: ð9Þ

330 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010

Page 9: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

Combining inequalities (6), (7), (8), and (9), we have

jSjjSj ¼

s1 þ s2 þ s3 þ s4s1 þ s2 þ s3 þ s4

� s3

s3¼ 8

3:

We now give an analysis of the complexities of ouralgorithm. Using an adjacency matrix representation of thecandidate set of base pairs, the first three steps of ouralgorithm can be implemented in Oðn2Þ time and space. Thefourth step of our algorithm is the dominating step. The2-interval graph GðDÞ has at most OðmÞ vertices andOðm2Þ edges; the construction of the graph takes Oðn2 þm2Þ time and space. Berman’s algorithm on GðDÞ takesOðmÞ iterations, where each iteration takes Oðm4Þ time tofind an improving 4-claw. The overall complexities of ouralgorithm are, therefore, Oðn2 þm2 þ TwðmÞÞ time andOðn2 þm2Þ space, where TwðmÞ ¼ Oðm5Þ is the time toobtain a ð5=2Þ-approximation for Maximum Weight Indepen-dent Set in a 5-claw-free graph of m vertices with boundedinteger weights.

We next present a ð5=2Þ-approximation algorithm forMaximum Stacking Base Pairs. This algorithm is almostidentical to our algorithm for Maximum Base Pair Stackingsexcept that the first three steps are omitted and the fourthstep is modified to assign a weight of 2 to a single stackingloop (with two stacking base pairs) and 3 to two consecutivestacking loops (with three stacking base pairs). The crucialobservation here, as noted by Lyngsø [21], is that we onlyneed to consider helices with two or three stacking basepairs, since a long helix with more than three base pairs canbe decomposed into several short helices, each with two orthree base pairs, such that the total weight of the shorthelices is the same as that of the long helix. We have provedthe following theorem.

Theorem 3. Given a sequence of n bases and a candidate set ofm ¼ Oðn2Þ base pairs, Maximum Base Pair Stackings can beapproximated with a ratio of 8=3 and Maximum Stacking BasePairs can be approximated with a ratio of 5=2 in Oðn2 þm2 þ TwðmÞÞ time and Oðn2 þm2Þ space, where TwðmÞ ¼Oðm5Þ is the time to obtain a ð5=2Þ-approximation forMaximum Weight Independent Set in a 5-claw-free graph ofm vertices with bounded integer weights.

We note that the 5-claw-free graph technique that wehave used to obtain improved approximation algorithms inthis section has been used earlier in another interestingproblem in computational biology called NonoverlappingLocal Alignment [2], [6], [7], which is essentially the problemMaximum Weight Independent Set in proper 2-union graphs[3] and is closely related to our problem Weighted 2-IntervalPattern over the model f ; ðÞg.

4 CONCLUDING REMARKS

A helix H of k consecutive stacking loops consists ofkþ 1 stacking base pairs and can be represented compactlyby a balanced 2-interval D of interval length kþ 1. In asense, the problems Maximum Base Pair Stackings, MaximumStacking Base Pairs, 2-Interval Pattern, and Weighted 2-IntervalPattern differ only on the weight of the 2-intervals. Theabove-mentioned balanced 2-interval D has a weight of k in

Maximum Base Pair Stackings, kþ 1 in Maximum StackingBase Pairs, 1 in 2-Interval Pattern, and an arbitrarynonnegative weight in Weighted 2-Interval Pattern. In thecontext of RNA secondary structure prediction, a long helixclearly has a more significant contribution to the global freeenergy than a short one. The connection between helicesand balanced 2-intervals suggests the following naturalvariant of 2-Interval Pattern.

Length-weighted balanced 2-interval pattern: Given a set of 2-intervals with the balanced restriction and with weightsequal to interval lengths, find a subset of disjoint 2-intervalsof maximum total weight.

Both the length-weighted and the unweighted (unit-weight) variants of 2-Interval Pattern on input with thebalanced restriction sit between two other variants of 2-Interval Pattern: 1) they are both special cases of Weighted2-Interval Pattern on input with the balanced restriction,which admits a 4-approximation [11]; 2) they bothgeneralize (unweighted) 2-Interval Pattern on input withthe unitary restriction, which admits a ð2þ �Þ-approxima-tion (Corollary 1). It is not at all clear which one of the twovariants (length-weighted and unweighted) is easier.Therefore, although the current best approximation ratiois still 4 for unweighted 2-Interval Pattern on input withthe balanced restriction [11], we nevertheless pose thefollowing open question.

Is there a polynomial-time algorithm for Length-WeightedBalanced 2-Interval Pattern with a constant approximationratio strictly less than 4?

The approximation algorithms presented in this paperhave relatively high time complexities. It is interesting toobtain faster algorithms (even with slightly worse approx-imation ratios). On the other hand, we note that the problemsize for RNA secondary structure prediction can drasticallybe reduced by the 2-interval model, which uses long helicesinstead of individual base pairs as input. Furthermore, theideas of our algorithms are simple. So it is quite possible that,through careful algorithmic engineering, our algorithms canbe turned into effective heuristics for RNA secondarystructure prediction. This will be our future work.

ACKNOWLEDGMENTS

The author thanks the anonymous referees for valuablecomments and references. This research was partiallysupported by the US National Science Foundation grantDBI-0743670 and Utah State University grant A13501.Results in this paper have appeared in preliminary formin the Third International Conference on Algorithmic Aspects inInformation and Management (AAIM ’07) [18] and FirstAnnual International Conference on Combinatorial Optimizationand Applications (COCOA ’07) [19].

REFERENCES

[1] T. Akutsu, “Dynamic Programming Algorithms for RNA Second-ary Structure Prediction with Pseudoknots,” Discrete AppliedMath., vol. 104, pp. 45-62, 2000.

[2] V. Bafna, B. Narayanan, and R. Ravi, “Nonoverlapping LocalAlignments (Weighted Independent Sets of Axis-Parallel Rectan-gles),” Discrete Applied Math., vol. 71, pp. 41-53, 1996.

JIANG: APPROXIMATION ALGORITHMS FOR PREDICTING RNA SECONDARY STRUCTURES WITH ARBITRARY PSEUDOKNOTS 331

Page 10: Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudoknots

[3] R. Bar-Yehuda, M.M. Halldorsson, J.S. Naor, H. Shachnai, and I.Shapira, “Scheduling Split Intervals,” SIAM J. Computing, vol. 36,pp. 1-15, 2006.

[4] F.H.D. van Batenburg, A.P. Gultyaev, C.W.A. Pleij, J. Ng, and J.Oliehoek, “Pseudobase: A Database with RNA Pseudoknots,”Nucleic Acids Research, vol. 28, pp. 201-204, 2000.

[5] S. Bereg, M. Kubica, T. Wale�n, and B. Zhu, “RNA MultipleStructural Alignment with Longest Common Subsequences,”J. Combinatorial Optimization, vol. 13, pp. 179-188, 2007.

[6] P. Berman, “A d=2 Approximation for Maximum WeightIndependent Set in d-Claw Free Graphs,” Nordic J. Computing,vol. 7, pp. 178-184, 2000.

[7] P. Berman, B. DasGupta, and S. Muthukrishnan, “SimpleApproximation Algorithm for Nonoverlapping Local Align-ments,” Proc. 13th Ann. ACM-SIAM Symp. Discrete Algorithms(SODA ’02), pp. 677-678, 2002.

[8] G. Blin, G. Fertin, and S. Vialette, “Extracting Constrained 2Interval Subsets in 2 Interval Sets,” Theoretical Computer Science,vol. 385, pp. 241-263, 2007.

[9] R.B. Cary and G.D. Stormo, “Graph-Theoretic Approach toRNA Modeling Using Comparative Data,” Proc. Third Int’lConf. Intelligent Systems for Molecular Biology (ISMB ’95),pp. 75-80, 1995.

[10] E. Chen, L. Yang, and H. Yuan, “Improved Algorithms for LargestCardinality 2 Interval Pattern Problem,” J. Combinatorial Optimiza-tion, special issue on bioinformatics, vol. 13, pp. 263-275, 2007.

[11] M. Crochemore, D. Hermelin, G.M. Landau, D. Rawitz, and S.Vialette, “Approximating the 2 Interval Pattern Problem,”Theoretical Computer Science, vol. 395, pp. 283-297, 2008.

[12] R.M. Dirks and N.A. Pierce, “A Partition Function Algorithm forNucleic Acid Secondary Structure Including Pseudoknots,”J. Computational Chemistry, vol. 24, pp. 1664-1677, 2003.

[13] P.P. Gardner and R. Giegerich, “A Comprehensive Comparison ofComparative RNA Structure Prediction Approaches,” BMCBioinformatics, vol. 5, article no. 140, 2004.

[14] M.M. Halldorsson, “Approximating Discrete Collections via LocalImprovements,” Proc. Sixth Ann. ACM-SIAM Symp. DiscreteAlgorithms (SODA ’95), pp. 160-169, 1995.

[15] C.A.J. Hurkens and A. Schrijver, “On the Size of Systems of SetsEvery t of Which Have an SDR, with an Application to the Worst-Case Ratio of Heuristics for Packing Problems,” SIAM J. DiscreteMath., vol. 2, pp. 68-72, 1989.

[16] S. Ieong, M.-Y. Kao, T.-W. Lam, W.-K. Sung, and S.-M. Yiu,“Predicting RNA Secondary Structure with Arbitrary Pseudo-knots by Maximizing the Number of Stacking Pairs,” J. Computa-tional Biology, vol. 10, pp. 981-995, 2003.

[17] M. Jiang, “A 2-Approximation for the Preceding-and-CrossingStructured 2 Interval Pattern Problem,” J. Combinatorial Optimiza-tion, special issue on bioinformatics, vol. 13, pp. 217-221, 2007.

[18] M. Jiang, “Improved Approximation Algorithms for PredictingRNA Secondary Structures with Arbitrary Pseudoknots,” Proc.Third Int’l Conf. Algorithmic Aspects in Information and Management(AAIM ’07), pp. 399-410, 2007.

[19] M. Jiang, “A PTAS for the Weighted 2 Interval Pattern Problemover the Preceding-and-Crossing Model,” Proc. First Ann. Int’lConf. Combinatorial Optimization and Applications (COCOA ’07),pp. 378-387, 2007.

[20] M. Jiang, M. Mayne, and J. Gillespie, “Delta: A Toolset for theStructural Analysis of Biological Sequences on a 3D TriangularLattice,” Proc. 2007 Int’l Symp. Bioinformatics Research and Applica-tions (ISBRA ’07), pp. 518-529, 2007.

[21] R.B. Lyngsø, “Complexity of Pseudoknot Prediction in SimpleModels,” Proc. 31st Int’l Colloquium Automata, Languages, andProgramming (ICALP ’04), pp. 919-931, 2004.

[22] R.B. Lyngsø and C.N.S. Pedersen, “RNA Pseudoknot Prediction inEnergy-Based Models,” J. Computational Biology, vol. 7, pp. 409-427, 2000.

[23] R.B. Lyngsø, M. Zuker, and C.N.S. Pedersen, “Fast Evaluation ofInterval Loops in RNA Secondary Structure Prediction,” Bioinfor-matics, vol. 15, pp. 440-445, 1999.

[24] S. Micali and V.V. Vazirani, “An OðffiffiffiffiffiffiffijV j

pjEjÞ Algorithm for

Finding Maximum Matching in General Graphs,” Proc. 21st Ann.Symp. Foundations of Computer Science (FOCS ’80), pp. 17-27, 1980.

[25] R. Nussinov and A.B. Jacobson, “Fast Algorithm for Predicting theSecondary Structure of Single-Stranded RNA,” Proc. Nat’l Academyof Sciences USA, vol. 77, pp. 6309-6313, 1980.

[26] R. Nussinov, G. Pieczenik, J.R. Griggs, and D.J. Kleitman,“Algorithms for Loop Matching,” SIAM J. Applied Math., vol. 35,pp. 68-82, 1978.

[27] N.R. Pace, B.C. Thomas, and C.R. Woese, “Probing RNAStructure, Function, and History by Comparative Analysis,” TheRNA World, second ed., pp. 113-141, Cold Spring HarborLaboratory Press, 1999.

[28] J. Reeder and R. Giegerich, “Design, Implementation andEvaluation of a Practical Pseudoknot Folding Algorithm Basedon Thermodynamics,” BMC Bioinformatics, vol. 5, article no. 104,2004.

[29] E. Rivas and S.R. Eddy, “A Dynamic Programming Algorithm forRNA Structure Prediction Including Pseudoknots,” J. MolecularBiology, vol. 285, pp. 2053-2068, 1999.

[30] D. Sankoff, “Simultaneous Solution of the RNA Folding, Align-ment and Protosequence Problems,” SIAM J. Applied Math.,vol. 45, pp. 810-825, 1985.

[31] J.E. Tabaska, R.B. Cary, H.N. Gabow, and G.D. Stormo, “An RNAFolding Method Capable of Identifying Pseudoknots and BaseTriples,” Bioinformatics, vol. 14, pp. 691-699, 1998.

[32] F. Tahi, M. Gouy, and M. Regnier, “Automatic RNA SecondaryStructure Prediction with a Comparative Approach,” Computersand Chemistry, vol. 26, pp. 521-530, 2002.

[33] I. Tinoco Jr., P.N. Borer, B. Dengler, M.D. Levine, O.C. Uhlenbeck,D.M. Crothers, and J. Gralla, “Improved Estimation of SecondaryStructure in Ribonucleic Acids,” Nature New Biology, vol. 246,pp. 40-42, 1973.

[34] I. Tinoco Jr. and C. Bustamante, “How RNA Folds,” J. MolecularBiology, vol. 293, pp. 271-281, 1999.

[35] Y. Uemura, A. Hasegawa, S. Kobayashi, and T. Yokomori, “TreeAdjoining Grammars for RNA Structure Prediction,” TheoreticalComputer Science, vol. 210, pp. 277-303, 1999.

[36] S. Vialette, “On the Computational Complexity of 2 IntervalPattern Matching Problems,” Theoretical Computer Science, vol. 312,pp. 223-249, 2004.

[37] Y. Wexler, C. Zilberstein, and M. Ziv-Ukelson, “A Study ofAccessible Motifs and RNA Folding Complexity,” J. ComputationalBiology, vol. 14, pp. 856-872, 2007.

[38] K.C. Wiese and A. Hendriks, “Comparison of P-RnaPredict andmfold—Algorithms for RNA Secondary Structure Prediction,”Bioinformatics, vol. 22, pp. 934-942, 2006.

[39] C. Witwer, I.L. Hofacker, and P.F. Stadler, “Prediction ofConsensus RNA Secondary Structures Including Pseudoknots,”IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1,no. 2, pp. 66-77, Apr. 2004.

[40] A. Xayaphoummine, T. Bucher, and H. Isambert, “Kinefold WebServer for RNA/DNA Folding Path and Structure PredictionIncluding Pseudoknots and Knots,” Nucleic Acids Research, vol. 33,pp. W605-W610, 2005.

[41] M. Zuker and D. Sankoff, “RNA Secondary Structures and TheirPrediction,” Bull. of Math. Biology, vol. 46, pp. 591-621, 1984.

[42] M. Zuker and P. Stiegler, “Optimal Computer Folding of LargeRNA Sequences Using Thermodynamics and Auxiliary Informa-tion,” Nucleic Acids Research, vol. 9, pp. 133-148, 1981.

Minghui Jiang received the BS degree inphysics from Peking University in 1997, twoMS degrees in computer science and physicsfrom Purdue University in 1999, and the PhDdegree in computer science from Montana StateUniversity in 2005. He is an assistant professorof computer science in Utah State University.His research interests are in the design andanalysis of algorithms, discrete and computa-tional geometry, bioinformatics and computa-

tional biology, and combinatorial optimization.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

332 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010