4
Information Processing Letters 112 (2012) 562–565 Contents lists available at SciVerse ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited Effat Farhana a,b , M. Sohel Rahman a,a AEDA Group, Department of CSE, BUET, Dhaka-1000, Bangladesh b Department of CSE, AUST, Dhaka-1208, Bangladesh article info abstract Article history: Received 23 September 2011 Received in revised form 14 April 2012 Accepted 18 April 2012 Available online 21 April 2012 Communicated by J. Torán Keywords: Finite automata Longest common subsequence Algorithms Combinatorial problems We revisit two recently studied variants of the classic Longest Common Subsequence (LCS) problem, namely, the Doubly-Constrained LCS (DC-LCS) and Hybrid-Constrained LCS (HC- LCS) problems. We present finite automata based algorithms for both problems. © 2012 Elsevier B.V. All rights reserved. 1. Introduction A subsequence of a given sequence s is obtained by deleting zero or more symbols from s. Given two se- quences, the longest common subsequence (LCS) problem is to find a common subsequence whose length is the longest. The classic dynamic programming algorithm to compute an LCS of two input strings was invented by Wag- ner and Fischer [13]. A constrained variant of the longest common subsequence (CLCS) problem, was first proposed by Tsai [12], where the computed LCS must contain a specific constraint (input) string as a subsequence. Subse- quently, this problem was addressed in [4,9,6,10]. Among other interesting constrained variants of LCS, repetition-free LCS [1], exemplar LCS [2], etc., may be cited. In this paper, we study two new variants of the CLCS problem, namely, the “Doubly-Constrained LCS (DC-LCS)” and the “Hybrid-Constrained LCS (HC-LCS)” problems. These two problems were very recently introduced and studied in [3] and [5], respectively. The problems are for- mally defined below. * Corresponding author. E-mail addresses: [email protected] (E. Farhana), [email protected] (M.S. Rahman). Problem 1 (DC-LCS). Given two input strings s 1 , s 2 , a set of constraint patterns C s and an occurrence constraint func- tion C o : Σ N , assigning an upper bound on the number of occurrences of each symbol σ Σ , the goal of DC-LCS is to find an LCS s of s 1 , s 2 such that s contains at most C o (σ ) occurrences of each symbol σ Σ and contains each pattern in C s as a subsequence. Problem 2 (HC-LCS). Given two input strings s 1 , s 2 , two constrained patterns P and Q , the goal of HC-LCS is to compute an LCS s of s 1 and s 2 such that s is a superse- quence of P but not of Q . DC-LCS is NP-hard for arbitrary number of constraint strings. Bonizzoni et al. [3] presented a fixed-parameter algorithm where the parameter k is the length of the solution. Their algorithm runs in k k T (k, |s 1 |, |s 2 |) time, where T (k, |s 1 |, |s 2 |) = (|s 1 | log |s 1 |2 O (k) ) + O (|s 1 ||s 2 ||s c 2 O (k) log |Σ|). Here, s c is one of the constraint patterns in C s and Σ is the set containing the pairs (σ , i ) for each σ Σ and i ∈{1,..., C o (σ )}. On the other hand, for HC-LCS, two algorithms were presented in [5], of time complexity O (n 2 | P || Q |) and O (| P || Q |r log log n + n log n), respectively, where n = max(|s 1 |, |s 2 |) and r is the total number of matches between s 1 , s 2 . Note that, in the worst 0020-0190/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ipl.2012.04.007

Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

Embed Size (px)

Citation preview

Page 1: Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

Information Processing Letters 112 (2012) 562–565

Contents lists available at SciVerse ScienceDirect

Information Processing Letters

www.elsevier.com/locate/ipl

Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

Effat Farhana a,b, M. Sohel Rahman a,∗a A�EDA Group, Department of CSE, BUET, Dhaka-1000, Bangladeshb Department of CSE, AUST, Dhaka-1208, Bangladesh

a r t i c l e i n f o a b s t r a c t

Article history:Received 23 September 2011Received in revised form 14 April 2012Accepted 18 April 2012Available online 21 April 2012Communicated by J. Torán

Keywords:Finite automataLongest common subsequenceAlgorithmsCombinatorial problems

We revisit two recently studied variants of the classic Longest Common Subsequence (LCS)problem, namely, the Doubly-Constrained LCS (DC-LCS) and Hybrid-Constrained LCS (HC-LCS) problems. We present finite automata based algorithms for both problems.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

A subsequence of a given sequence s is obtained bydeleting zero or more symbols from s. Given two se-quences, the longest common subsequence (LCS) problemis to find a common subsequence whose length is thelongest. The classic dynamic programming algorithm tocompute an LCS of two input strings was invented by Wag-ner and Fischer [13]. A constrained variant of the longestcommon subsequence (CLCS) problem, was first proposedby Tsai [12], where the computed LCS must contain aspecific constraint (input) string as a subsequence. Subse-quently, this problem was addressed in [4,9,6,10]. Amongother interesting constrained variants of LCS, repetition-freeLCS [1], exemplar LCS [2], etc., may be cited.

In this paper, we study two new variants of the CLCSproblem, namely, the “Doubly-Constrained LCS (DC-LCS)”and the “Hybrid-Constrained LCS (HC-LCS)” problems.These two problems were very recently introduced andstudied in [3] and [5], respectively. The problems are for-mally defined below.

* Corresponding author.E-mail addresses: [email protected] (E. Farhana),

[email protected] (M.S. Rahman).

0020-0190/$ – see front matter © 2012 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.ipl.2012.04.007

Problem 1 (DC-LCS). Given two input strings s1, s2, a set ofconstraint patterns Cs and an occurrence constraint func-tion Co :Σ → N , assigning an upper bound on the numberof occurrences of each symbol σ ∈ Σ , the goal of DC-LCSis to find an LCS s of s1, s2 such that s contains at mostCo(σ ) occurrences of each symbol σ ∈ Σ and containseach pattern in Cs as a subsequence.

Problem 2 (HC-LCS). Given two input strings s1, s2, twoconstrained patterns P and Q , the goal of HC-LCS is tocompute an LCS s of s1 and s2 such that s is a superse-quence of P but not of Q .

DC-LCS is NP-hard for arbitrary number of constraintstrings. Bonizzoni et al. [3] presented a fixed-parameteralgorithm where the parameter k is the length of thesolution. Their algorithm runs in kk T (k, |s1|, |s2|) time,where T (k, |s1|, |s2|) = (|s1| log |s1|2O (k)) + O (|s1||s2||sc| ×2O (k) log ˜|Σ |). Here, sc is one of the constraint patternsin Cs and ˜Σ is the set containing the pairs (σ , i) foreach σ ∈ Σ and i ∈ {1, . . . , Co(σ )}. On the other hand,for HC-LCS, two algorithms were presented in [5], of timecomplexity O (n2|P ||Q |) and O (|P ||Q |r log log n + n log n),respectively, where n = max(|s1|, |s2|) and r is the totalnumber of matches between s1, s2. Note that, in the worst

Page 2: Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

E. Farhana, M.S. Rahman / Information Processing Letters 112 (2012) 562–565 563

case, r = O (n2), hence the latter algorithm is slightly worsethan the former in the worst case. Notably, finite alpha-bet was assumed in [5]. In this paper, we devise finiteautomata based efficient algorithms for both DC-LCS andHC-LCS problems.

2. Preliminaries

To formally describe our algorithms the following defi-nitions are necessary.

Definition 1 (DFA). A Deterministic Finite Automatonis represented by 5-tuple notation A = (Q ,Σ, δ,q0, F ),where A is the name of the DFA, Q its set of states, Σ

its set of input symbols, δ its transition function, q0 itsstart state and F its set of accepting states.

Definition 2 (DASG). A Directed Acyclic Subsequence Graph(DASG) for a string s of length n is a DFA that accepts thelanguage of all possible 2n subsequences of s. The DFA ispartial, that is, each state may not have a transition definedfor every symbol [8].

Definition 3 (DASG for multiple texts). Let S be a set ofstrings T1, T2, . . . , Tk . We say that P is a subsequence of Sif and only if there exists i ∈ [1,k] such that P is a subse-quence of Ti . DASG of S is a DFA A which accepts the lan-guage L(A) = {w: i ∈ [1,k], w is a subsequence of Ti} [8].

Definition 4 (Common Subsequence Automaton). Given a setof strings, a Common Subsequence Automaton (CSA) ac-cepts all common subsequences of the given strings. Thelanguage accepted by CSA is a subset of the language ac-cepted by the DASG for a set of strings.

Definition 5 (Supersequence Automaton). A SupersequenceAutomaton is a finite automaton which accepts the set ofall supersequences of a given string.

3. A fixed-parameter algorithm for DC-LCS

We present a fixed-parameter algorithm for the DC-LCSproblem where the parameter k is the size of a solution ofDC-LCS. The algorithm consists of five main stages.

Stage 1: In the first stage, we build a CSA automationwhich accepts all common subsequences of twoinput strings. This is done as follows. The DASGM1 for the two input texts is constructed by us-ing the online algorithm of [8]. Then commonsubsequence automaton (CSA) of the two stringsis obtained from the DASG M1 by consideringthe ‘match’ values computed for each state. Thevalue of ‘match’ corresponds to the number ofinput strings that contain a given string, s asa subsequence [8]. So, we will prune all stateswhose ‘match’ value is less than two to get thedesired CSA M ′

1. In the worst case, R = O (n2)

states are generated for two input strings wheren = max(|s1|, |s2|).

Stage 2: In the second stage, we build |Cs| supersequenceautomata Mi

2, 1 � i � |Cs|, for each constraintpattern in si

c ∈ Cs using the algorithm in [11].Clearly, Mi

2 will accept all the strings containingthe pattern si

c as a subsequence.Stage 3: In the third stage, we intersect all |Cs| automata

Mi2, 1 � i � |Cs|, with M ′

1 using the algorithmin [11]. The resulting automaton accepts all com-mon subsequences of s1, s2 including each pat-tern of Cs as a subsequence. We call the resultingautomaton M3.

Stage 4: We consider character constraint in the fourthstage. As the size of a solution of DC-LCS is k,each σ ∈ Σ cannot occur more than k times.For each σ ∈ Σ , we can construct a DFA whichaccepts all strings having at most σocc occur-rences, where σocc = min(k, Co(σ )). We intersectall these |Σ | automata with M3 and denote itby M4. The automaton M4 accepts the sequencesthat are accepted by M3 but do not violate theconstraint function Co :Σ → N .

Stage 5: In the final stage, we have to select DC-LCSof length k from M4. This can be done easilyby a modification of maximum length automata(MaxLen automata) [11]. In brief, the algorithmis a modification of the longest path algorithmfor DAGs (Directed Acyclic Graph) that works inO (E) time, where E is the number of edges inthe input DAG. It can be easily modified to acceptstrings in a DAG of length k. We call the resultingautomata, M5.

3.1. Time complexity

In our algorithm, we have used the online algorithmof [8] for DASG construction and the algorithms in [11]for supersequence and intersection automata construction.For the sake of convenience, we assume that the length ofeach input string is n. The length of each constraint pat-tern may safely be assumed to be k since our solution sizeis bounded by k. To analyze our algorithm, we need thefollowing result.

Lemma 1. (See [11].) Given DFA M1 and M2 having R andn states respectively, a DFA M accepting language L(M) =L(M1) ∩L(M2) can be constructed in O (|Σ |Rn) time. M hasat most Rn states and at most |Σ |Rn transitions. Moreover, ifM1 (or M2) is acyclic, then M is also acyclic.

We also state the following easy lemma.

Lemma 2. Given an integer N denoting the upper bound on oc-currences of letter σ ∈ Σ , a DFA M accepting all the stringscontaining at most N occurrences of σ can be built in O (N)

time and M has O (N) states.

Construction of CSA from DASG M1 requires O (|Σ |×(R+ 2) + 2n) time [8], as the number of states of DASG isR = O (n2) in the worst case. Building supersequence au-tomaton for each Mi , 1 � i � |Cs|, needs O (|Σ |k) time.

2
Page 3: Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

564 E. Farhana, M.S. Rahman / Information Processing Letters 112 (2012) 562–565

Algorithm 1 DC-LCS(DAG-M3 = (Q 3,Σ, δ,q03, F3), Co :Σ → N)1: {Global variables}2: path /∗path is a list or an array∗/

3: DcLCS /∗the required DC-LCS∗/

4: len /∗length of the corresponding DcLCS∗/

5: countσ /∗Array to count different letters in path∗/

6: findPath(currentNode, path)7: if currentNode ∈ F3 and path.length>len then8: DcLCS ← path9: len ← path.length

10: end if11: for all s ∈ Σ do12: if countσ [s] + 1 < Co(s) and δ(currentNode, s) �= null then13: concatenate s to path and update countσ14: findPath(δ(currentNode, s), path)15: remove s from path and update countσ16: end if17: end for18: return

Fig. 1. Pseudocode for DC-LCS.

Time complexity to intersect |Cs| automata with M1 isO (|Σ |Rk|Cs |). Resulting DFA M3 has O (Rk|Cs |) states. Wecan safely assume Co(σ ) = k for all σ ∈ Σ . In Stage 4,the intersection of M3 with Σ automata each having kstates requires O (|Σ |Rk|Cs |+|Σ |) time. DC-LCS of length kcan be found in O (E) = O (|Σ |Rk|Cs |+|Σ |) time, where Eis the number of transitions of M3. So, the overall timecomplexity is O (|Σ |(R+ 2) + 2n + |Σ |Rk|Cs |+|Σ |). Simpli-fying the asymptotic complexity expression, we can writeO (|Σ |n2k|Cs |+|Σ |), assuming R= n2.

At this point a brief discussion on our running timeand that of the previous algorithms is in order. Time com-plexity of our fixed-parameter algorithm is certainly betterthan that of [3] for |Cs| > 1, which needs kk((n log n2O (k))+O (n2|sc| × 2O (k) log ˜|Σ |)), where ˜|Σ | � n, |s1| = |s2| = n.For single constraint pattern, the simplified time com-plexity of our algorithm is O (|Σ |n2k|Σ |+1) compared toO (n log n2O (k)) + O (n2|sc| × 2O (k) log ˜|Σ |) of [3].

4. An exact algorithm for DC-LCS

In this section, we give an exact algorithm for DC-LCS.First, we have to follow our algorithm up to Stage 3 tobuild M3. Afterwards, we apply Algorithm 1 on M3. Notethat, a sequence corresponding to an accepting state of M3is a common subsequence of s1, s2 including each of thepatterns of Cs as a subsequence. Starting from the ini-tial state, we recursively generate all paths of the DAG,M3 satisfying the function Co :Σ → N . We keep track ofthe longest of such paths. Algorithm 1 formally presentsthe idea. Given M3 and the constraint function Co , Al-gorithm 1 recursively generates all the paths of M3 sat-isfying Co and keeps track of the longest one. Since, allnodes in M3 have Σ outgoing arcs, the time complexityof Algorithm 1 is O (|Σ |d), where d is the maximum re-cursion depth. The overall time complexity of the exactalgorithm can be derived by summing up the time com-plexity to build M3 and that of Algorithm 1 and hence isO (|Σ |(R+ 2)+ 2n +|Σ |Rk|Cs | + |Σ |d). Simplifying the ex-pression, we get O (|Σ |Rk|Cs | + |Σ |d).

5. An algorithms for HC-LCS

Solving HC-LCS follows a similar approach. We con-struct CSA M1 of s1 and s2 as before. We also build asupersequence automation M P

2 for P . Complement (i.e.making the final states as non-accepting states and viceversa [7]) of supersequence automation is built for Q . Thisis denoted by M Q

2 . Then we construct M3 by intersecting

all 3 automata, namely, M1, M P2 and M Q

2 . Now, we need toselect the sequences of highest length from M3. We applythe MaxLen automata algorithm [11] on M3 and get thefinal automata M4.

5.1. Time complexity

Time complexity to build CSA is the same as before.Here, we have to compute the intersection of DASG twice.So the intersection requires O (|Σ |R|P ||Q |) time. MaxLenalgorithm works in linear time in the number of transi-tions of the input DAG [11]. Hence, the overall time com-plexity is O (|Σ |(R + 2) + 2n + |Σ |R|P ||Q |). For a fixedsized alphabet, the time complexity becomes O ((R+ 2) +2n + R|P ||Q |). Now, recall that, assuming fixed sized al-phabet, two algorithms for HC-LCS with running timesO (n2|P ||Q |) and O (|P ||Q |r log log n + n log n) were pre-sented in [5]. Clearly, the running time of our algorithmis comparable to that of [5]. In fact, our algorithm, insome sense, possesses the best features of the two algo-rithms of [5]: since R = O (n2), in the worst case, ouralgorithm would match the running time of the former; onthe other hand, it may show a far better behaviour sinceR is an input sensitive parameter like r of the runningtime of the latter. Finally, note that our algorithm is ca-pable to handle arbitrary number of constraint patternsas well. Here, we just need to construct supersequenceand complement of supersequence automata for patternsin the second stage and intersect them with M1 in thethird stage. Then MaxLen algorithm is applied on it.

References

[1] Said Sadique Adi, Marília D.V. Braga, Cristina G. Fernandes, Carlos Ed-uardo Ferreira, Fábio Viduani Martinez, Marie-France Sagot, Marco A.Stefanes, Christian Tjandraatmadja, Yoshiko Wakabayashi, Repetition-free longest common subsequence, Discrete Applied Mathemat-ics 158 (12) (2010) 1315–1324.

[2] Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, GuillaumeFertin, Raffaella Rizzi, Stéphane Vialette, Exemplar longest commonsubsequence, IEEE/ACM Transactions on Computational Biology andBioinformatics 4 (4) (2007) 535–543.

[3] Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola,Variants of constrained longest common subsequence, InformationProcessing Letters 110 (20) (2010) 877–881.

[4] Y.C. Chen, K.M. Chao, On the generalized constrained longest com-mon subsequence problem, Journal of Combinatorial Optimiza-tion 21 (3) (2011) 383–392.

[5] Y.C. Chen, Algorithms for the hybrid constrained longest commonsubsequence problem, in: The 27th Workshop on CombinatorialMathematics and Computation Theory, 2010.

[6] E. Farhana, J. Ferdous, T. Moosa, M.S. Rahman, Finite automata basedalgorithm for the generalized constrained longest common subse-quence problems, in: String Processing and Information Retrieval(SPIRE), in: LNCS, vol. 6393, 2010, pp. 250–257.

Page 4: Doubly-Constrained LCS and Hybrid-Constrained LCS problems revisited

E. Farhana, M.S. Rahman / Information Processing Letters 112 (2012) 562–565 565

[7] John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman, Introduction toAutomata Theory, Languages, and Computation, 2nd edition, PearsonEducation, 2001.

[8] H. Hoshino, A. Shinohara, M. Takeda, S. Arikawa, Online constructionof subsequence automata for multiple texts, in: String Processing andInformation Retrieval (SPIRE), 2000.

[9] C. Iliopoulos, M.S. Rahman, New efficient algorithms for the lcs andconstrained lcs problem, Information Processing Letters 106 (2008)13–18.

[10] Costas S. Iliopoulos, M. Sohel Rahman, Wojciech Rytter, Algorithmsfor two versions of lcs problem for indeterminate strings, Journal of

Combinatorial Mathematics and Combinatorial Computing 71 (2009)155–172.

[11] Costas S. Iliopoulos, M. Sohel Rahman, Michal Vorácek, LadislavVagner, Finite automata based algorithms on subsequences andsupersequences of degenerate strings, Journal of Discrete Algo-rithms 8 (2) (2010) 117–130.

[12] Yin-Te Tsai, The constrained longest common subsequence problem,Information Processing Letters 88 (4) (2003) 173–176.

[13] R.A. Wagner, M.J. Fischer, The string to string correction problem,Journal of the ACM 21 (1) (1974) 168–173.