26
Efficient Algorithms to Compute Compressed Longest Common Substrings and Compressed Palindromes Wataru Matsubara a Shunsuke Inenaga b Akira Ishino a1 Ayumi Shinohara a Tomoyuki Nakamura a Kazuo Hashimoto a a Graduate School of Information Sciences, Tohoku University, Japan b Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan Abstract This paper studies two problems on compressed strings described in terms of straight line programs (SLPs ). One is to compute the length of the longest common substring of two given SLP-compressed strings, and the other is to compute all palindromes of a given SLP-compressed string. In order to solve these problems efficiently (in polynomial time w.r.t. the compressed size) decompression is never feasible, since the decompressed size can be exponentially large. We develop combinatorial algorithms that solve these problems in O(n 4 log n) time with O(n 3 ) space, and in O(n 4 ) time with O(n 2 ) space, respectively, where n is the size of the input SLP-compressed strings. 1 Introduction The importance of algorithms for compressed texts has recently been aris- ing due to the massive increase of data that are treated in compressed form. Of various text compression schemes introduced so far, straight line program (SLP ) is one of the most powerful and general compression schemes. An SLP 1 Presently at Google Japan Inc. Email addresses: [email protected] (Wataru Matsubara), [email protected] (Shunsuke Inenaga), [email protected] (Akira Ishino), [email protected] (Ayumi Shinohara), [email protected] (Tomoyuki Nakamura), [email protected] (Kazuo Hashimoto). Preprint submitted to Elsevier Science 12 August 2009

Efficient Algorithms to Compute Compressed Longest Common

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Efficient Algorithms to Compute Compressed

Longest Common Substrings and Compressed

Palindromes

Wataru Matsubara a Shunsuke Inenaga b Akira Ishino a 1

Ayumi Shinohara a Tomoyuki Nakamura a Kazuo Hashimoto a

aGraduate School of Information Sciences, Tohoku University, JapanbGraduate School of Information Science and Electrical Engineering, Kyushu

University, Japan

Abstract

This paper studies two problems on compressed strings described in terms of straightline programs (SLPs). One is to compute the length of the longest common substringof two given SLP-compressed strings, and the other is to compute all palindromesof a given SLP-compressed string. In order to solve these problems efficiently (inpolynomial time w.r.t. the compressed size) decompression is never feasible, since thedecompressed size can be exponentially large. We develop combinatorial algorithmsthat solve these problems in O(n4 log n) time with O(n3) space, and in O(n4) timewith O(n2) space, respectively, where n is the size of the input SLP-compressedstrings.

1 Introduction

The importance of algorithms for compressed texts has recently been aris-ing due to the massive increase of data that are treated in compressed form.Of various text compression schemes introduced so far, straight line program(SLP) is one of the most powerful and general compression schemes. An SLP

1 Presently at Google Japan Inc.Email addresses: [email protected] (Wataru Matsubara),

[email protected] (Shunsuke Inenaga),[email protected] (Akira Ishino), [email protected] (AyumiShinohara), [email protected] (Tomoyuki Nakamura),[email protected] (Kazuo Hashimoto).

Preprint submitted to Elsevier Science 12 August 2009

is a context-free grammar of either of the forms X → Y Z or X → a, where ais a constant. SLP allows exponential compression, i.e., the original (uncom-pressed) string length N can be exponentially large w.r.t. the correspondingSLP size n. In addition, resulting encoding of most grammar- and dictionary-based text compression methods such as the LZ-family [13,14], run-lengthencoding, multi-level pattern matching code [5], Sequitur [10] and so on, canquickly be transformed into SLPs [2,12,3]. Therefore, it is of great interest toanalyze what kind of problems on SLP-compressed strings can be solved inpolynomial time w.r.t. n. Moreover, for those that are polynomial solvable, itis of great importance to design efficient algorithms. In so doing, one has tonotice that decompression is never feasible, since it can require exponentialtime and space w.r.t. n.

The first polynomial time algorithm for SLP-compressed strings was givenby Plandowski [11], which tests the equality of two SLP-compressed stringsin O(n4) time. Later on Karpinski et al. [4] presented an O(n4 log n)-timealgorithm for the substring pattern matching problem for two SLP-compressedstrings. Then it was improved to O(n4) time by Miyazaki et al. [9] and recentlyto O(n3) time by Lifshits [6]. The problem of computing the minimum periodof a given SLP-compressed string was shown to be solvable in O(n4 log n)time [4], and lately in O(n3 log N) time [6]. Ga̧sieniec et al. [2] claimed thatall squares of a given SLP-compressed string can be computed in O(n6 log5 N)time.

On the other hand, there are some hardness results on SLP-compressed stringprocessing. Lifshits and Lohrey [7] showed that the subsequence pattern match-ing problem for SLP-compressed strings is NP-hard, and that computing thelength of the longest common subsequence of two SLP-compressed stringsis also NP-hard. Lifshits [6] showed that computing the Hamming distancebetween two SLP-compressed strings is #P-complete.

In this paper we tackle the following two problems: one is to compute thelength of the longest common substring of two SLP-compressed strings, andthe other is to find all maximal palindromes of an SLP-compressed string.The first problem was listed as an open problem in [6]. This paper closes theproblem giving an algorithm that runs in O(n4 log n) time with O(n3) space.For the second problem of computing all maximal palindromes, we give analgorithm that runs in O(n4) time with O(n2) space.

Comparison to previous work. Composition system is a generalizationof SLP which also allows “truncations” for the production rules. Namely, arule of composition systems is of one of the following forms: X → Y [i]Z[j],X → Y Z, or X → a, where Y [i] and Z[j] denote the prefix of length i of Yand the suffix of length j of Z, respectively. Ga̧sieniec et al. [2] presented analgorithm that computes all maximal palindromes from a given composition

2

system in O(n log2 N ×Eq(n)) time, where Eq(n) denotes the time needed forthe equality test of composition systems. Since Eq(n) = O(n4 log2 N) in [2],the overall time cost is O(n5 log4 N).

Limited to SLPs, Eq(n) = O(n3) due to the recent work by Lifshits [6].Still, computing all maximal palindromes takes O(n4 log2 N) time in total,and therefore our solution with O(n4) time is faster than the previous knownones (recall that N = O(2n)). The space requirement of the algorithm byGa̧sieniec et al. [2] is unclear. However, since the equality test algorithm of [6]takes O(n2) space, the above-mentioned O(n4 log2 N)-time solution takes atleast as much space as ours.

A preliminary version of this work appeared in [8].

2 Preliminaries

2.1 Notations on Strings

For any set U of pairs of integers, we denote U⊕k = {(i+k, j+k) | (i, j) ∈ U}.We denote by 〈a, d, t〉 the arithmetic progression with the minimal element a,the common difference d and the number of elements t, that is, 〈a, d, t〉 ={a + (i − 1)d | 1 ≤ i ≤ t}. When t = 0, let 〈a, d, t〉 = ∅.

Let Σ be a finite alphabet. An element of Σ∗ is called a string. The length of astring T is denoted by |T |. The empty string ε is a string of length 0, namely,|ε| = 0. For a string T = XY Z, X, Y and Z are called a prefix, substring, andsuffix of T , respectively. The i-th character of a string T is denoted by T [i]for 1 ≤ i ≤ |T |, and the substring of a string T that begins at position i andends at position j is denoted by T [i : j] for 1 ≤ i ≤ j ≤ |T |. For any string T ,let TR denote the reversed string of T , namely, TR = T [|T |] · · ·T [2]T [1].

For any two strings T, S, let LCPref (T, S), LCStr(T, S), and LCSuf (T, S)denote the length of the longest common prefix, substring and suffix of T andS, respectively.

A period of a string T is an integer p (1 ≤ p ≤ |T |) such that T [i] = T [i + p]for any i = 1, 2, . . . , |T | − p.

A non-empty string T such that T = TR is said to be a palindrome. When |T |is even, then T is said to be an even palindrome, that is, T = SSR for someS ∈ Σ+. Similarly, when |T | is odd, then T is said to be an odd palindrome,that is, T = ScSR for some S ∈ Σ∗ and c ∈ Σ. For any string T and itssubstring T [i : j] such that T [i : j] = T [i : j]R, T [i : j] is said to be the

3

X1 X2

a ba a ab a b a b a a b

X1 X3

X1 X2

X3

X1 X2

X3

X4

X1

X5X4

X6

X1 X2

X3

X1 X2

X3

X4

X1

X5

X7

Fig. 1. The derivation tree of SLP T of Example 1 that generates the stringT = aababaababaab.

maximal palindrome w.r.t. the center b i+j2c, if either T [i−1] 6= T [j +1], i = 1,

or j = |T |. In particular, T [1 : j] is said to be a prefix palindrome of T , andT [i : |T |] be a suffix palindrome of T .

2.2 Text Compression by Straight Line Programs

In this paper, we treat strings described in terms of straight line programs(SLPs). A straight line program T is a sequence of assignments such that

X1 = expr1, X2 = expr2, . . . , Xn = exprn,

where each Xi is a variable and each expri is an expression in either of thefollowing form:

• expri = a (a ∈ Σ), or• expri = X`Xr (`, r < i).

Denote by T the string derived from the last variable Xn of the program T .The size of the program T is the number n of assignments in T . We remarkthat |T | = O(2n).

Example 1 SLP T = {Xi}7i=1 with X1 = a, X2 = b, X3 = X1X2, X4 =

X1X3, X5 = X3X4, X6 = X4X5, and X7 = X6X5 generates string T =aababaababaab. The derivation tree of SLP T is shown in Fig. 1.

When it is not confusing, we identify a variable Xi with the string derivedfrom Xi. Then, |Xi| denotes the length of the string derived from Xi.

4

For any variable Xi of T with 1 ≤ i ≤ n, we define XRi as follows:

XRi =

a if Xi = a (a ∈ Σ),

XRr XR

` if Xi = X`Xr (`, r < i).

Let T R be the SLP consisting of variables XRi for 1 ≤ i ≤ n. The following

lemma is important for our algorithms which will be given later on.

Lemma 1 SLP T R derives string TR.

Proof. By induction on the variables XRi . Let ΣT be the set of characters

appearing in T . For any 1 ≤ i ≤ |ΣT |, we have Xi = a for some a ∈ ΣT ,thus XR

i = a and a = aR. Let Ti denote the string derived from Xi. Forthe induction hypothesis, assume that XR

j derives TRj for any 1 ≤ j ≤ i. Now

consider variable Xi+1 = X`Xr. Note Ti+1 = T`Tr, which implies TRi+1 = TR

r TR` .

By definition, we have XRi+1 = XR

r XR` . Since `, r < i + 1, by the induction

hypothesis XRi+1 derives TR

r TR` = TR

i+1. Thus, T R = XRn derives TR

n = TR. 2

Example 2 For SLP T = {Xi}7i=1 of Example 1, its reversed SLP T R =

{XRi }7

i=1 consists of XR1 = a, XR

2 = b, XR3 = XR

2 XR1 , XR

4 = XR3 XR

1 , XR5 =

XR4 XR

3 , XR6 = XR

5 XR4 , and XR

7 = XR5 XR

6 . SLP T R generates the reversedstring TR = (aababaababaab)R = baababaababaa.

Note that SLP T R can be easily computed from SLP T in O(n) time.

3 Computing Longest Common Substring of Two SLP CompressedStrings

Let T and S be the SLPs of sizes n and m, which describe strings T and S,respectively. Without loss of generality we assume that n ≥ m.

In this section we tackle the following problem:

Problem 1 Given two SLPs T and S, compute LCStr(T, S).

In what follows we present an algorithm that solves Problem 1 in O(n4 log n)time and O(n3) space. Let Xi and Yj denote any variable of T and S for1 ≤ i ≤ n and 1 ≤ j ≤ m.

5

3.1 Overlaps between Two Strings

For any two strings X and Y , we define the set OL(X,Y ) as follows:

OL(X,Y ) = {k > 0 | X[|X| − k + 1 : |X|] = Y [1 : k]}

Namely, OL(X,Y ) is the set of lengths of overlaps of suffixes of X and prefixesof Y .

Example 3 For strings X = ababbab and Y = babbabb, OL(X,Y ) ={1, 3, 6} since b, bab and babbab are both suffixes of X and prefixes of Y .

Karpinski et al. [4] gave the following results for computation of OL for stringsdescribed by SLPs.

Lemma 2 ([4]) For any variables Xi and Xj of an SLP T , OL(Xi, Xj) canbe represented by O(n) arithmetic progressions.

Theorem 1 ([4]) For any SLP T , OL(Xi, Xj) can be computed in total ofO(n4 log n) time and O(n3) space for any 1 ≤ i ≤ n and 1 ≤ j ≤ n.

In order to solve Problem 1 it is useful to compute OL(Xi, Yj) and OL(Yj, Xi)for each 1 ≤ i ≤ n and 1 ≤ j ≤ m. In so doing, we produce a new variableV = XnYm, that is, V is a concatenation of SLPs T and S. Then we computeOL for each pair of variables in the new SLP of size n+m. On the assumptionthat n ≥ m, it takes O(n4 log n) time and O(n3) space in total.

3.2 The FM function

For any two SLP variables Xi, Yj and any integer k with 1 ≤ k ≤ |Xi|, wedefine function FM (Xi, Yj, k) which returns the position which is just oneposition to the left of the first position of mismathces when we compare Yj

with Xi at position k. Namely, FM (Xi, Yj, k) equals the length of the longestcommon prefix of Xi[k : |Xi|] and Yj;

FM (Xi, Yj, k) = LCPref (Xi[k : |Xi|], Yj).

Example 4 Consider variables X6 =aababaab and X5 =abaab of Example 1.Then FM (X6, X5, 2)=3 as LCPref (X6[2 : |X6|], X5)=LCPref (ababaab, abaab)=3.

Lemma 3 ([4]) For any variables Xi, Yj and integer k, FM (Xi, Yj, k) canbe computed in O(n log n) time, provided that OL(Xi′ , Yj′) is already computedfor any 1 ≤ i′ ≤ i and 1 ≤ j′ ≤ j.

6

3.3 Efficient Computation of Longest Common Substrings

The main idea of our algorithm for computing LCStr(T, S) is based on thefollowing observation.

Observation 1 For any substring Z of string T , there always exists a variableXi = X`i

Xriof SLP T such that:

• Z is a substring of Xi and• Z touches or covers the boundary between X`i

and Xri.

Example 5 Consider SLP T of Example 1 generating T = aababaababaab.Substring baababaab of T is a substring of X7 = X6X5 and covers the bound-ary between X6 and X5. Substring baab of T is a substring of X5 = X3X4

and covers the boundary between X3 and X4. Substring T [7] = a of T is asubstring of X3 = X1X2 and touches the boundary between X1 and X2. (Seealso Fig. 1.)

It directly follows from Observation 1 that any common substring of stringsT, S touches or covers both of the boundaries in Xi and Yj for some 1 ≤ i ≤ nand 1 ≤ j ≤ m.

For any SLP variables Xi = X`iXri

, Yj = Y`jYrj

and any non-negative integerk, let h1 and h2 be the maximum values such that Xi[|X`i

| − k − h1 + 1 :|X`i

| + h2] = Yj[|Y`j| − h1 + 1 : |Y`j

| + k + h2]. That is,

h1 =LCSuf (X`i[1 : |X`i

| − k], Y`j) and

h2 =LCPref (Xri, Yrj

[k + 1 : |Yrj|]).

Then let

ExtXi,Yj(k) =

k + h1 + h2 if Xi = X`iXri

and Yj = Y`jYrj

,

k if Xi or Yj is constant.

For a set S of integers, we define ExtXi,Yj(S) = {ExtXi,Yj

(k) | k ∈ S}.ExtYj ,Xi

(k) and ExtYj ,Xi(S) are defined similarly.

The next observation follows from the above arguments (see also Fig. 2):

Observation 2 For any strings T and S, LCStr(T, S) equals to the maximumelement of the set∪

1≤i≤n,1≤j≤m

(ExtXi,Yj(OL(X`i

, Yrj)) ∪ ExtYj ,Xi

(OL(Y`j, Xri

)) ∪ ExtXi,Yj(0)),

7

kh1 h2

Xi

Yj

Xli Xri

YrjYlj

kh1 h2

Xi

Yj

Xli Xri

YrjYlj

LCSuf LCPref

Xi

Yj

Xli Xri

YrjYlj

Fig. 2. Illustration of Observation 2. Each candidate for LCStr(T, S) can be com-puted by extending either some overlap between X`i

and Yri or some overlap betweenY`i

and Xri , or concatenating LCSuf (X`i, Y`i

) and LCPref (Xri , Yri).

Based on Observation 2, our strategy for computing LCStr(T, S) is to computemax(ExtXi,Yj

(OL(X`i, Yrj

))), max(ExtYj ,Xi(OL(Y`j

, Xri))), and ExtXi,Yj

(0) foreach pair of Xi and Yj. Notice that ExtXi,Yj

(0) can be computed in O(n log n)time due to Lemma 3, provided that the reversed SLP T R and Occ4(XR

i , XRj )

are already computed for each pair of variables XRi and XR

j in T R. Lemma 4below shows how to compute max(ExtXi,Yj

(OL(X`i, Yrj

))) and max(ExtYj ,Xi(OL(Y`j

, Xri)))

using FM .

Lemma 4 For any variables Xi = X`iXri

and Yj = Y`jYrj

, we can computemax(ExtXi,Yj

(OL(X`i, Yrj

))) and max(ExtYj ,Xi(OL(Y`j

, Xri))) in O(n2 log n)

time.

Proof. Here we concentrate on computing max(ExtXi,Yj(OL(X`i

, Yrj))), as the

case of max(ExtYj ,Xi(OL(Y`j

, Xri))) is just symmetric. Let 〈a, d, t〉 be any of

the O(n) arithmetic progressions of OL(X`i, Yrj

).

Assume that t > 1 and a < d. The cases where t = 1 or a = d are easier to

8

Yj

Ylj Yrj

XriXli

Xi

} case 1

case 4

case 5

} case 2

} case 3

e2 e1

e4e3

k

Fig. 3. Illustration for the proof of Lemma 4. The dark rectangles represent theoverlaps between X`i

and Yrj . Case 6 is the special case where cases 4 and 5 happenat the same time and case 3 does not exist.

show. Let u = Yrj[1 : a] and v = Yrj

[a + 1 : d]. For any string w, let w∗ denotean infinite repetition of w, that is, w∗ = www · · · .

Let e1, e2 be the largest integer such that Xi[|X`i| − e2 + 1 : |X`i

| + e1] is thelongest substring of Xi that contains Xi[|X`i

|−d+1 : |X`i|] and has a period d.

Similarly, let e3, e4 be the largest integer such that Yj[|Y`j|− e4 +1 : |Y`j

|+ e3]is the longest substring of Yj that contains Yj[|Y`j

| + 1 : |Y`j| + d] and has a

period d. More formally,

e1 =LCPref (Xri, (vu)∗) =

FM (Yrj, Xri

, a+1) if FM (Yrj, Xri

, a+1)<d,

FM (Xri, Xri

, d + 1) + d otherwise,

e2 =LCSuf (X`i, (vu)∗) = FM (XR

`i, XR

`i, d + 1) + d,

e3 =LCPref (Yrj, (uv)∗) = FM (Yrj

, Yrj, d + 1) + d,

e4 =LCSuf (Y`j, (uv)∗) =

FM (XR`i, Y R

`j, a+1) if FM (XR

`i, Y R

`j, a+1)<d,

FM (Y R`j

, Y R`j

, d + 1) + d otherwise.

(See also Fig. 3.) As above, we can compute e1, e2, e3, e4 by at most 6 calls ofFM .

Let k ∈ 〈a, d, t〉. We categorize ExtXi,Yj(k) depending on the value of k, as

follows.

case 1: When k < min{e3 − e1, e2 − e4}. If k − d ∈ 〈a, d, t〉, it is not difficult

9

to see ExtXi,Yj(k) = ExtXi,Yj

(k − d) + d. Therefore, we have

A = max{ExtXi,Yj(k) | k < min{e3 − e1, e2 − e4}} = ExtXi,Yj

(k′),

where k′ = max{k | k < min{e3 − e1, e2 − e4}}.case 2: When k > max{e3 − e1, e2 − e4}. If k + d ∈ 〈a, d, t〉, it is not difficult

to see ExtXi,Yj(k) = ExtXi,Yj

(k + d) + d. Therefore, we have

B = max{ExtXi,Yj(k) | k > max{e3 − e1, e2 − e4}} = ExtXi,Yj

(k′′),

where k′′ = min{k | k > max{e3 − e1, e2 − e4}}.case 3: When min{e3−e1, e2−e4} < k < max{e3−e1, e2−e4}. In this case we

have ExtXi,Yj(k) = min{e1+e2, e3+e4} for any k with min{e3−e1, e2−e4} <

k < max{e3 − e1, e2 − e4}. Thus

C = max{ExtXi,Yj(k) | min{e3 − e1, e2 − e4} < k < max{e3 − e1, e2 − e4}}

= min{e1 + e2, e3 + e4}.

case 4: When k = e3 − e1. In this case we have

D = ExtXi,Yj(k) = k + min{e2 − k, e4} + LCPref (Yrj

[k + 1 : |Yrj|], Xri

)

= k + min{e2 − k, e4} + FM (Yrj, Xri

, k + 1).

case 5: When k = e2 − e4. In this case we have

E = ExtXi,Yj(k) = k + LCSuf (X`i

[1 : |X`i| − k], Y`j

) + min{e1, e3 − k}= k + FM (XR

`i, Y R

`j, k + 1) + min{e1, e3 − k}.

case 6: When k = e3 − e1 = e2 − e4. In this case we have

F =ExtXi,Yj(k)

= k + LCSuf (X`i[1 : |X`i

| − k], Y`j) + LCPref (Yrj

[k + 1 : |Yrj|], Xri

)

= k + FM (XR`i, Y R

`j, k + 1) + FM (Yrj

, Xri, k + 1).

Then clearly the following inequality stands (see also Fig. 3):

F ≥ max{D,E} ≥ C ≥ max{A,B}. (1)

A membership query to the arithmetic progression 〈a, d, t〉 can be answeredin constant time. Also, an element k ∈ 〈a, d, t〉 such that min{e3 − e1, e2 −e4} < k < max{e3 − e1, e2 − e4} of case 3 can be found in constant time,if such exists. k′ and k′′ of case 1 and case 2, respectively, can be computedin constant time as well. Therefore, based on inequality (1), we can computemax(ExtXi,Yj

(〈a, d, t〉)) by at most 2 calls of FM , provided that e1, e2, e3, e4

are already computed.

10

Since OL(X`i, Yrj

) contains O(n) arithmetic progressions by Lemma 2, andeach call of FM takes O(n log n) time by Lemma 3, max(ExtXi,Yj

(OL(X`i, Yrj

)))can be computed in O(n2 log n) time. 2

A pseudo-code of our algorithm is given in Algorithm 1.

Algorithm 1: Computing LCStr(T, S).

Input: SLPs T = {Xi}ni=1, S = {Yj}m

j=1

Output: Length of longest common substring of strings T and Sfor i = 1 to n do1

for j = 1 to m do2

compute OL(Xi, Yj) and OL(Yj, Xi);3

4

L = ∅;5

for i = 1 to n do6

for j = 1 to m do7

L =8

L∪max(ExtXi,Yj(OL(Xli , Yrj)))∪max(ExtYj,Xi

(OL(Ylj , Xri)))∪ ExtXi,Yj(0);

9

return max(L);10

Now we obtain the main result of this section.

Theorem 2 Algorithm 1 solves Problem 1 in O(n4 log n) time with O(n3)space.

Proof. The correctness of the algorithm is clear from lines 6-10 which corre-spond to Observation 2.

It follows from Theorem 1 that it takes O(n4 log n) time and O(n3) space inlines 1-4.

For any variables Xi = X`iXri

and Yj = Y`jYrj

, max(ExtXi,Yj(OL(X`i

, Yrj)))

and max(ExtYj ,Xi(OL(Y`j

, Xri))) can be computed in O(n2 log n) time by Lemma 4.

Since each of max(ExtXi,Yj(OL(X`i

, Yrj))) and max(ExtYj ,Xi

(OL(Y`j, Xri

))) issingleton, we have |L| = O(n2). Hence it takes O(n4 log n) time in lines 6-10.

Overall, the algorithm works in O(n4 log n) time with O(n3) space. 2

The following corollary is immediate from Theorem 2.

Corollary 1 Given two SLPs T and S describing strings T and S respec-tively, the beginning and ending positions of a longest common substring of Tand S can be computed in O(n4 log n) time with O(n3) space.

11

4 Computing Palindromes from SLP Compressed Strings

In this section we present an efficient algorithm that computes a succinctrepresentation of all maximal palindromes of string T , when its correspondingSLP T is given as input. The algorithm runs in O(n4) time and O(n2) space,where n is the size of the input SLP T .

4.1 The Problem

For any string T , let Pals(T ) denote the set of pairs of the beginning andending positions of all maximal palindromes in T , namely,

Pals(T ) = {(p, q) | T [p : q] is the maximal palindrome centered at bp+q2c}.

Note that |Pals(T )| = O(|T |) = O(2n). Thus we consider a succinct represen-tation of Pals(T ) in the sequel.

Let PPals(T ) and SPals(T ) denote the set of pairs of the beginning and endingpositions of the prefix and suffix palindromes of T , respectively, that is,

PPals(T ) = {(1, q) ∈ Pals(T ) | 1 ≤ q ≤ |T |}, and

SPals(T ) = {(p, |T |) ∈ Pals(T ) | 1 ≤ p ≤ |T |}.

Example 6 For string T = aababaababaab, PPals(T ) = {(1, q) | q ∈ {1, 2, 7, 12}},since a, aa, aababaa, and aababaababaa are prefix palindromes. Also, SPals(T ) ={(p, 13) | p ∈ {5, 10, 13}}, since baababaab, baab and b are suffix palindromes.

It is easy to see that for any non-empty string T , PPals(T ), SPals(T ) andPals(T ) are non-empty sets.

Let Xi denote a variable in T for 1 ≤ i ≤ n. For any variables Xi = X`Xr, letPals4(Xi) be the set of pairs of beginning and ending positions of maximalpalindromes of Xi that cover or touch the boundary between X` and Xr,namely,

Pals4(Xi) = {(p, q) ∈ Pals(Xi) | 1 ≤ p ≤ |X`| + 1, |X`| ≤ q ≤ |Xi|, p ≤ q}.

Example 7 Consider variable X6 = X4X5 = aababaab of Example 1, whereX4 = aab and X5 = abaab. Pals4(X6) = {(2, 4), (1, 7), (4, 6)} since X6[2 :4] = aba, X6[1 : 7] = aababaa, and X6[4 : 6] = aba are the maximal palin-dromes that touch or cover the boundary of X4 and X5.

We have the following observation for decomposition of Pals(Xi) (see Fig. 4).

12

Xi

Xl Xr

pp p

12

3

Fig. 4. Illustration of Observation 3. Any maximal palindrome of Xi is a non-suffixmaximal palindrome of X` (like p1), a maximal palindrome of Xi covering or touch-ing the boundary of Xi (like p2), or a non-prefix maximal palindrome of Xr (likep3).

Observation 3 For any variables Xi = X`Xr,

Pals(Xi) = (Pals(X`) − SPals(X`)) ∪Pals4(Xi) ∪ ((Pals(Xr) − PPals(Xr)) ⊕ |X`|).

Thus, the desired output Pals(T ) = Pals(Xn) can be represented as a com-bination of {Pals4(Xi)}n

i=1, {PPals(Xi)}ni=1 and {SPals(Xi)}n

i=1. Therefore,computing Pals(T ) is reduced to computing Pals4(Xi), PPals(Xi) and SPals(Xi),for every i = 1, 2, . . . , n. The problem to be tackled in this section follows:

Problem 2 Given an SLP T of size n, compute succinct representations{Pals4(Xi)}n

i=1, {PPals(Xi)}ni=1 and {SPals(Xi)}n

i=1.

Note that the sizes of {Pals4(Xi)}ni=1, {PPals(Xi)}n

i=1 and {SPals(Xi)}ni=1

can be O(2n). Thus we output succinct representations of these sets which arepolynomial in n. In the following sections we show how to succinctly representand compute these sets.

4.2 Succinct Representations of PPals(X) and SPals(X)

Ga̧sieniec et al. [2] claimed that PPals(X) and SPals(X) can be representedby O(log |X|) arithmetic progressions for any string X. However, they gave noproof regarding it. Although they stated that a proof is to be given in a fullversion of the paper, unfortunately it has never appeared. This section is tosupply a full proof to show that PPals(X) and SPals(X) can be representedby O(log |X|) arithmetic progressions.

13

i j q

X

Fig. 5. (1, q) ∈ PPals(X) implies X[i : j] = X[q−j+1 : q−i+1]R.

Let us focus on the space requirement of PPals(X), as that of SPals(X) canbe shown similarly. Recall that PPals(X) is the set of pairs of the beginningand ending positions of all prefix palindromes of X.

The following lemma is obvious but is quite helpful to prove Lemma 6.

Lemma 5 For any integers q, such that (1, q) ∈ PPals(X) and i, j with1 ≤ i < j ≤ q, we have X[i : j] = X[q − j + 1 : q − i + 1]R.

Proof. Since (1, q) is the prefix palindrome in X, we have X[i] = X[q − i + 1]for any i with 1 ≤ i ≤ q, which implies that:

X[i : j] = X[i] X[i+1] · · ·X[j−1] X[j]

= X[q−i+1] X[q−i] · · ·X[q−j+2] X[q−j+1]

= (X[q−j+1] X[q−j+2] · · ·X[q−i] X[q−i+1])R

= X[q − j + 1 : q − i + 1]R.

(see also Fig. 5) 2

Lemma 6 For any positive integers a and d, if (1, a), (1, a + d) ∈ PPals(X)and a − d ≥ 0, then (1, a − d) ∈ PPals(X).

Proof. We show X[1 : a − d] = X[1 : a − d]R, which yields that a − d is thelength of a prefix palindrome in X. By applying Lemma 5, we have

X[1 : a − d] = X[a − (a − d) + 1 : a − 1 + 1]R (2)

= X[d + 1 : a]R

= (X[(a + d) − a + 1 : (a + d) − (d + 1) + 1]R)R

(3)

= X[d + 1 : a]

= X[1 : a − d]R

where Equation (2) comes from (1, a) ∈ PPals(X), whereas Equation (3)comes from (1, a + d) ∈ PPals(X). (see also Fig. 6). 2

Let a1, a2, . . . , ak be the sequence of integers in increasing order, such thatPPals(X) = {(1, a1), (1, a2), . . . , (1, ak)}. We define di as the progression dif-ferences for ai, that is, di = ai+1 − ai for 1 ≤ i < k. The next lemma states

14

X

a+da-d ad

w

w R

w

w R

Fig. 6. (1, a) ∈ PPals(X) and (1, a + d) ∈ PPals(X) implies (1, a − d) ∈ PPals(X).

X

ai

ai+1

ai+2

di+1di+1

di

Fig. 7. di > di+1 contradicts the definition of {ai}ki=1.

that the sequence {di}k−1i=1 is monotonically non-decreasing.

Lemma 7 di ≤ di+1 for any 1 ≤ i < k − 1.

Proof. Suppose di > di+1 holds for some 1 ≤ i < k − 1. Since (1, ai+1) ∈PPals(X) and (1, ai+2) = (1, ai+1 + di+1) ∈ PPals(X), Lemma 6 claims that(1, ai+1 − di+1) ∈ PPals(X). However, ai = ai+1 − di < ai+1 − di+1 < ai+1,which contradicts the definition that (1, ai+1) is the next element to (1, ai) inPPals(X) in increasing order (see also Fig. 7). 2

Lemma 8 If di+1 6= di, then di+1 ≥ di + di−1.

Proof. By Lemma 6, we have (1, ai+1 − di) ∈ PPals(X) since (1, ai+1) ∈PPals(X) and (1, ai+2) = (1, ai+1+di+1) ∈ PPals(X). Therefore, ai+1−di+1 =aj for some 1 ≤ j ≤ i, so that di+1 = ai+1 − aj =

∑i`=j(a`+1 − a`) =

∑i`=j d`.

If di+1 6= di, we have j < i, which implies di+1 =∑i

`=j d` ≥ di + di−1. 2

The following is a key lemma of this subsection:

Lemma 9 For any variable X, PPals(X) and SPals(X) can be representedby O(log |X|) arithmetic progressions.

15

Proof. We show that PPals(X) can be represented by O(log |X|) arithmeticprogressions. The case of SPals(X) can be proved similarly.

It follows from Lemma 6 that, for any positive integer r such that ai−rdi > 0,we have ai−rdi ∈ PPals(X). For any ai and di, let ti = max{y | ai−(y−1)di >0} and a′

i = ai − (ti − 1)di. That is, a′i is the smallest element of arithmetic

progression 〈a′i, di, ti〉. Then, if di = di+1, it holds that 〈a′

i, di, ti〉 ∪ {ai+1} =〈a′

i+1, di+1, ti+1〉. For any integers p, q and any arithmetic progression 〈a, d, t〉such that p ≤ a and q ≥ a + (t − 1)d, let

(p, 〈a, d, t〉) = {(p, a + (i − 1)d) | 1 ≤ i ≤ t}, and

(〈a, d, t〉, q) = {(a + (i − 1)d, q) | 1 ≤ i ≤ t}.

Then we have PPals(X) =∪

1≤i≤n(1, 〈a′i, di, ti〉) =

∪i∈{i|di 6=di+1}(1, 〈a′

i, di, ti〉).The worst case scenario in terms of the number of arithmetic progressions inPPals(X) is that di 6= di+1 for each i. By Lemma 8, the actual worst case isgiven by the following sequence {di}k−1

i=1 :

di =

2 for i = 1,

3 for i = 2,

di−1 + di−2 for i > 2.

Now, let Fj denote the j-th Fibonacci number, namely,

Fj =

1 for j = 1, 2,

Fj−1 + Fj−2 for j > 2.

It is a well-known fact that Fi = ϕi−(1−ϕ)i√

5= b ϕi

√5

+ 12c, where ϕ =

√5+12

.

Clearly di = Fi+2. Therefore, the general term of {ai} can be represented asfollows:

ai = ai−1 + di−1 = ai−2 + di−2 + di−1 · · · = a1 +i−1∑k=1

dk = a1 +i+1∑k=3

Fk

= a1 +i+1∑k=1

Fk − F1 − F2 = 1 + Fi+1+2 − 1 − 1 − 1 = Fi+3 − 2.

Now we have the following formula for the largest element ak of {ai}ki=1.

ak = Fk+3 − 2 = bϕk+3

√5

+1

2c − 2 >

ϕk+3

√5

+1

2− 1 − 2.

Since ak ≤ |X| and ϕ > 1, we have that k = O(logϕ |X|) = O(log |X|). 2

16

4.3 Efficient Computation of Pals4(Xi), PPals(Xi) and SPals(Xi)

In this section we show how to efficiently compute Pals4(Xi), PPals(Xi) andSPals(Xi).

The next lemma points out that SPals(X`) and PPals(Xr) are useful to com-pute Pals4(Xi).

Lemma 10 For any variable Xi = X`Xr and any (p, q) ∈ Pals4(Xi), thereexists an integer l ≥ 0 such that (p + l, q − l) ∈ SPals(X`) ∪ (PPals(Xr) ⊕|X`|) ∪ {(|X`|, |X`| + 1)}.

Proof. Since Xi[p : q] is a palindrome, Xi[p + l : q − l] is also a palindrome forany 0 ≤ l < bp+q

2c. Then we have the following three cases:

(1) When bp+q2c < |X`|, for l = p − |X`|, we have (p + l, q − l) ∈ SPals(X`).

(2) When bp+q2c > |X`|, for l = |X`|−p+1, we have (p+l, q−l) ∈ PPals(Xr).

(3) When bp+q2c = |X`|, if q−p+1 is odd, then the same arguments to case 1

apply, since X`[|X`|] = X`[|X`|]R and (|X`|, |X`|) ∈ SPals(X`). If q−p+1is even, let l = |X`| − p. In this case, we have p + q = 2|X`| + 1. Thus,p + l = |X`| and q − l = |X`| + 1.

2

By Lemma 10, Pals4(Xi) can be computed by “extending” all palindromes inSPals(X`) and PPals(Xr) to the maximal within Xi, and finding the maximaleven palindromes centered at |X`| in Xi. In so doing, for any (maximal ornon-maximal) palindrome P = Xi[p : q], we define function ExtXi

as

ExtXi(p, q) = (p − h, q + h),

where h ≥ 0 and Xi[p − h : q + h] is the maximal palindrome centered atposition bp+q

2c in Xi. For any p, q with Xi[p : q] not being a palindrome,

we leave ExtXi(p, q) undefined. There are the following natural properties on

function ExtXi:

• the input and output palindromes are centered at the same position;• if |P | = q − p + 1 is odd, then ExtXi

(p, q) is also an odd palindrome;• if |P | = q − p + 1 is even, then ExtXi

(p, q) is also an even palindrome.

For a set S of pairs of integers, let ExtXi(S) = {ExtXi

(p, q) | (p, q) ∈ S}.

Let

Pals∗(Xi) = {(|X`| − k + 1, |X`| + k) ∈ Pals(Xi) | k ≥ 1}.The next observations give us a procedure to compute Pals4(Xi).

17

h

Xi

Xl Xr

h h

Xi

Xl Xr

h

Xi

Xl Xr

kk

Fig. 8. Illustration of Observation 4. Any element of Pals(Xi) can be computed byextending either some prefix palindrome of SPals(X`) or some suffix palindrome ofPPals(Xr), or it is the maximal even palindrome centered at |X`| in Xi.

Observation 4 For any variable Xi = X`Xr,

Pals4(Xi) = ExtXi(SPals(X`)) ∪ ExtXi

(PPals(Xr) ⊕ |X`|) ∪ Pals∗(Xi). (4)

See also Fig. 8 that illustrates Observation 4.

In what follows we show how to efficiently execute the Ext functions in Equa-tion (4). Let us first briefly recall the work of [9,6]. For any variables Xi = X`Xr

and Xj, we define the set Occ4(Xi, Xj) of all occurrences of Xj that cover ortouch the boundary between X` and Xr, namely,

Occ4(Xi, Xj) = {s > 0 | Xi[s : s+ |Xj|−1] = Xj, |X`|− |Xj|+1 ≤ s ≤ |X`|}.

Theorem 3 ([6]) For any variables Xi and Xj, Occ4(Xi, Xj) can be com-puted in total of O(n3) time and O(n2) space.

Lemma 11 ([9]) For any variables Xi, Xj and integer k, FM (Xi, Xj, k) canbe computed in O(n2) time, provided that Occ4(Xi′ , Xj′) is already computedfor any 1 ≤ i′ ≤ i and 1 ≤ j′ ≤ j.

Lemma 12 For any variable Xi = X`Xr and any arithmetic progression〈a, d, t〉 with (1, 〈a, d, t〉) ⊆ PPals(Xr), ExtXi

((1, 〈a, d, t〉)) can be represented

18

by at most 2 arithmetic progressions and a pair of the beginning and end-ing positions of a maximal palindrome, and can be computed by at most 4calls of FM . The same holds for any arithmetic progression 〈a′, d′, t′〉 with(〈a′, d′, t′〉, |X`|) ⊆ SPals(X`).

The above lemma can be inherently proven by Lemma 3.4 of [1]. However, forthe sake of completeness we supply a full proof of the lemma in Appendix.

We are now ready to prove the following lemma:

Lemma 13 For any variable Xi = X`Xr, Pals4(Xi) can be represented byO(log |Xi|) arithmetic progressions and can be computed in O(n2 log |Xi|) time.

Proof. Recall Observation 4. It is clear from the definition that Pals∗(Xi) iseither singleton or empty. When it is a singleton, it consists of the maximaleven palindrome centered at |X`|. Let k = FM (Xr, X

R` , 1). Then we have

Pals∗(Xi) =

∅ if k = 0,

{(|X`| − k + 1, |X`| + k)} otherwise.

Due to Lemma 11, Pals∗(Xi) can be computed in O(n2) time.

Now we consider ExtXi(PPals(X`)). It follows from Lemma 12 that each sub-

set ExtXi((1, 〈a, d, t〉)) ⊆ ExtXi

(PPals(X`)) can be represented by O(1) arith-metic progressions. Also, ExtXi

((1, 〈a, d, t〉)) can be computed in O(n2) timedue to Lemma 11 and Lemma 12. It follows from Lemma 9 that PPals(X`)consists of O(log |X`|) arithmetic progressions. Thus ExtXi

(PPals(X`)) can becomputed in O(n2 log |X`|) time. Similar arguments hold for ExtXi

(SPals(Xr)).

Hence, by Observation 4, Pals4(Xi) can be represented by O(log |Xi|) arith-metic progressions and can be computed in O(n2 log |Xi|) time. 2

On the other hand, PPals(Xi) and SPals(Xi) can be computed using Pals4(Xi),as follows:

Observation 5 For any variable Xi = X`Xr,

PPals(Xi) = (PPals(X`) − (1, |X`|)) ∪ {(1, q) ∈ Pals4(Xi)} and

SPals(Xi) = ((SPals(Xr) − (1, |Xr|)) ⊕ |X`|) ∪ {(p, |Xi|) ∈ Pals4(Xi)}.

See also Fig. 9 that illustrates Observation 5.

Lemma 14 For any SLP variable Xi = X`Xr, PPals(Xi) and SPals(Xi)can be computed in O(log |Xi|) time, provided that PPals(X`), SPals(Xr) andPals4(Xi) are already computed.

19

Xi

Xl Xr

Xi

Xl Xr

Fig. 9. Illustration of Observation 5. Any element of PPals(Xi) is either an elementof PPals(X`) or an element of Pals4(Xi) whose beginning position is 1. Similararguments hold for SPals(Xi).

Proof. Clear from Lemma 9 and Lemma 13. 2

4.4 Results

Algorithm 2 shows a pseudo-code of our algorithm that computes a succinctrepresentation of all maximal palindromes of a given SLP-compressed string.

Algorithm 2: Computing succinct representation of Pals(T ).

Input: SLP T = {Xi}ni=1

Output: Succinct representation of Pals(T) for string Tfor i = 1 to n do1

for j = 1 to n do2

compute Occ4(Xi, Xj);3

4

for i = 1 to n do5

SPals(Xi) = ∅; PPals(Xi) = ∅; Pals4(Xi) = ∅;6

for i = 1 to n do7

if Xi = a then /* Xi is constant */8

SPals(Xi) = 〈1, 1, 1〉; PPals(Xi) = 〈1, 1, 1〉; Pals4(Xi) = 〈1, 1, 1〉;9

else /* Xi = XlXr */10

Pals4(Xi) = ExtXi(SPals(Xl)) ∪ ExtXi

(PPals(Xr) ⊕ |Xl|) ∪ Pals∗(Xi);11

PPals(Xi) = PPals(Xl) ∪ {(p, |Xl|) ∈ Pals4(Xi)};12

SPals(Xi) = (SPals(Xr) ⊕ |Xl|) ∪ {(1, q) ∈ Pals4(Xi)};13

14

return {Pals4(Xi)}ni=1, {SPals(Xi)}n

i=1, {PPals(Xi)}ni=1;15

The main result of this section is the following theorem.

Theorem 4 Algorithm 2 solves Problem 2 in O(n4) time with O(n2) space.

20

Proof. The correctness of the algorithm follows from lines 11-13 that corre-spond to Observations 4 and 5.

Now we analyze the time complexity. It follows from Theorem 3 that it takesO(n3) time in total for lines 1-4. By Lemma 13 it takes O(n2 log |Xi|) time forline 11. Also, by Lemma 14 it takes O(log |Xi|) time for lines 12-13. Thereforethe time complexity for the for loop of line 7 is O(n4). Hence the overall timecost is O(n4).

The total space complexity is as follows. It follows from Theorem 3 that ittakes O(n2) space for lines 1-4. By Lemma 13, it takes O(log |Xi|) space forline 11. Also, by Lemma 9, it takes O(log |Xi|) space for lines 12-13. Thereforethe space complexity for the for loop of line 7 is O(n2). Hence the overallspace requirement is O(n2). 2

The following two theorems are results obtained by slightly modifying thealgorithm of the previous subsections.

Theorem 5 Given an SLP T that describes string T , whether T is a palin-drome or not can be determined with extra O(1) space and without increasingasymptotic time complexities of the algorithm.

Proof. It suffices to see if (1, |T |) ∈ PPals(T ) = PPals(Xn). By Lemma 9,PPals(Xn) can be represented by O(n) arithmetic progressions. It is not dif-ficult to see that T is a palindrome if and only if a + (t − 1)d = |T | for thearithmetic progression 〈a, d, t〉 of the largest common difference among thosein PPals(Xn). Such an arithmetic progression can easily be found during com-putation of PPals(Xn) without increasing asymptotic time complexities of thealgorithm. 2

Theorem 6 Given an SLP T that describes string T , the position pair (p, q)of the longest palindrome in T can be found with extra O(1) space and withoutincreasing asymptotic time complexities of the algorithm.

Proof. We compute the beginning and ending positions of the longest palin-drome in Pals4(Xi) for i = 1, 2, . . . , n. It takes O(n) time for each Xi. If itslength exceeds the length of the currently kept palindrome, we update thebeginning and ending positions. 2

Provided that {PPals(Xi)}ni=1, {SPals(Xi)}n

i=1, and {Pals4(Xi)ni=1} are al-

ready computed, we have the following result:

Theorem 7 Given a pair (p, q) of integers, it can be answered in O(n) timewhether or not substring T [p : q] is a maximal palindrome of T .

Proof. We binary search the derivation tree of SLP T until finding the variable

21

Xi = X`Xr such that 1+offset ≤ p ≤ |X`|+offset and 1+offset + |X`| ≤ q ≤|Xi| + offset . This takes O(n) time. Due to Observation 4, for each variableXi, Pals4(Xi) can be represented by O(n) arithmetic progressions plus a pairof the beginning and ending positions of a maximal palindrome. Thus, we cancheck if (p, q) ∈ Pals4(Xi) in O(n) time. 2

5 Conclusions and Further Work

In this paper we considered strings compressed by straight line programs(SLPs). Since SLP-compressed strings can be exponentially small w.r.t. theuncompressed (original) strings, it is significant to process SLP-compressedstrings without decompression and in time polynomial in the compressed sizen. In this paper, we showed the first polynomial time algorithm to compute thelongest common substring of two given SLP-compressed strings, which runsin O(n4 log n) time and O(n3) space. In addition, we presented an O(n4)-timeO(n2)-space algorithm to compute all maximal palindromes of a given SLP-compressed strings. This is faster than the O(n4 log N)-time solution obtainedby combining the results of Ga̧sieniec et al. [2] and Lifshits [6].

Our future work includes extending our results to computing all squares froma given SLP-compressed string. Ga̧sieniec et al. [2] claimed that all squarescan be found in O(n6 log5 N) time from strings compressed by compositionssystems, which are generalization of SLPs. The time complexity would beimproved to O(n5 log3 N) in combination with the algorithm by Lifshits [6].Still, it might be possible to produce a faster solution using the techniquespresented in this paper.

References

[1] A. Apostolico, D. Breslauer, and Z. Galil. Parallel detection of all palindromesin a string. Theoretical Computer Science, 141:163–173, 1995.

[2] L. Gasieniec, M. Karpinski, W. Plandowski, and W. Rytter. Efficient algorithmsfor Lempel-Ziv encoding. In Proc. 5th Scandinavian Workshop on AlgorithmTheory (SWAT’96), volume 1097 of Lecture Notes in Computer Science, pages392–403. Springer-Verlag, 1996.

[3] S. Inenaga, A. Shinohara, and M. Takeda. An efficient pattern matchingalgorithm on a subclass of context free grammars. In Proc. Eighth InternationalConference on Developments in Language Theory (DLT’04), volume 3340 ofLecture Notes in Computer Science, pages 225–236. Springer-Verlag, 2004.

22

[4] M. Karpinski, W. Rytter, and A. Shinohara. An efficient pattern-matchingalgorithm for strings with short descriptions. Nordic Journal of Computing,4:172–186, 1997.

[5] J. Kieffer, E. Yang, G. Nelson, and P. Cosman. Universal lossless compressionvia multilevel pattern matching. IEEE Transactions on Information Theory,46(4):1227–1245, 2000.

[6] Y. Lifshits. Processing compressed texts: A tractability border. In Proc. 18thAnnual Symposium on Combinatorial Pattern Matching (CPM’07), volume4580 of Lecture Notes in Computer Science, pages 228–240. Springer-Verlag,2007.

[7] Y. Lifshits and M. Lohrey. Querying and embedding compressed texts. InProc. 31st International Symposium on Mathematical Foundations of ComputerScience (MFCS’06), volume 4162 of Lecture Notes in Computer Science, pages681–692. Springer-Verlag, 2006.

[8] W. Matsubara, S. Inenaga, A. Ishino, A. Shinohara, T. Nakamura, andK. Hashimoto. Computing longest common substring and all palindromes fromcompressed strings. In Proc. 34th International Conference on Current Trendsin Theory and Practice of Computer Science (SOFSEM’08), volume 4910 ofLecture Notes in Computer Science, pages 364–375. Springer-Verlag, 2008.

[9] M. Miyazaki, A. Shinohara, and M. Takeda. An improved pattern matchingalgorithm for strings in terms of straight-line programs. In Proc. 8th AnnualSymposium on Combinatorial Pattern Matching (CPM’97), volume 1264 ofLecture Notes in Computer Science, pages 1–11. Springer-Verlag, 1997.

[10] C. G. Nevill-Manning, I. H. Witten, and D. L. Maulsby. Compression byinduction of hierarchical grammars. In Data Compression Conference ’94, pages244–253. IEEE Computer Society, 1994.

[11] W. Plandowski. Testing equivalence of morphisms on context-free languages.In Proc. Second Annual European Symposium on Algorithms (ESA’94), volume855 of Lecture Notes in Computer Science, pages 460–470. Springer-Verlag,1994.

[12] W. Rytter. Grammar compression, LZ-encodings, and string algorithms withimplicit input. In Proc. 31st International Colloquium on Automata, Languagesand Programming (ICALP’04), volume 3142 of Lecture Notes in ComputerScience, pages 15–27. Springer-Verlag, 2004.

[13] J. Ziv and A. Lempel. A universal algorithm for sequential data compression.IEEE Transactions on Information Theory, IT-23(3):337–349, 1977.

[14] J. Ziv and A. Lempel. Compression of individual sequences via variable-lengthcoding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.

23

Appendix

This appendix is to give a complete proof for Lemma 12. To prove this lemma,we need to show the following lemma:

Lemma 15 For any variable Xi and {(1, q) | q ∈ 〈a, d, t〉} ⊆ PPals(Xi), thereexist palindromes u, v and a non-negative integer k, such that (uv)t+k−1u is aprefix of Xi, |uv| = d and |(uv)ku| = a.

Proof. Let k = max{h | a − hd > 0}, a′ = a − kd. It is not difficult tosee that 〈a′, d, t + k〉 ⊆ PPals(Xi). Let w = Xi[1 : d], u = Xi[1 : a′], andv = Xi[a

′ + 1 : d]. Then, a = a′ + kd = |u| + k|uv| = |(uv)ku|.

Since (1, a′ + d) ∈ PPals(Xi), Xi[d + 1 : a′ + d] = uR. Also, for any 1 ≤ j ≤t + k − 1, since (1, a′ + jd) ∈ PPals(Xi), we have

Xi[a′ + jd + 1 : a′ + (j + 1)d] = wR.

Thus uvuR(wR)t+k−2 is a prefix of Xi.

Since (1, a′) ∈ PPals(Xi), u is a palindrome. Since (1, a′ + d) ∈ PPals(Xi),uvuR is a palindrome, which implies that v is also a palindrome. Consequently,

uvuR(wR)t+k−2 = uvu((uv)R)t+k−2 = uvu(vRuR)t+k−2

= uvu(vu)t+k−2 = u(vu)t+k−1 = (uv)t+k−1u.

Therefore, (uv)t+k−1u is a prefix of Xi. 2

In the above lemma, clearly |uv| = d is a period of string (uv)tu.

We are now ready to prove Lemma 12. (See also Fig. 10.)

Proof. Let us consider ExtXi({1, 〈a, d, t〉}). By Lemma 15, Xr[1 : a+(t−1)d] =

(uv)t+k−1u, where |uv| = d and |(uv)ku| = a. Let x be the maximum integersuch that Xr[1 : x] has a period |uv|. Namely, Xr[1 : x] is the longest prefix ofXr that has a period |uv|. Then x can be computed by using FM as follows:

x = FM (Xr, Xr, d + 1) + d.

Let y be the largest integer such that (uv)y is a prefix of XR` . Then y can be

computed by at most 2 calls of FM , as follows. First, we call FM to checkwhether or not the string uv is a prefix of XR

` . If FM (Xr, XR` , 1) < d, then

y = FM (Xr, XR` , 1). Otherwise, by Lemma 1 we can compute y by:

y = FM (XR` , XR

` , d + 1) + d.

24

Let e` = |X`| − y + 1 and er = |X`| + x. Then, clearly string Xi[e` : er] has aperiod d. Let

〈a, d, t〉= 〈a1, d, t1〉 ∪ 〈a2, d, t2〉 ∪ 〈a3, d, t3〉= 〈a, d, t1〉 ∪ 〈a + t1d, d, t2〉 ∪ 〈a + (t1 + t2)d, d, t3〉, such that

|X`| − e` + 1 < er − q1 for any q1 ∈ 〈a1, d, t1〉,|X`| − e` + 1 = er − q2 for any q2 ∈ 〈a2, d, t2〉,|X`| − e` + 1 > er − q3 for any q3 ∈ 〈a3, d, t3〉,

and t1 + t2 + t3 = t. For the first and the last arithmetic progressions, we have:

ExtXi((1, 〈a1, d, t1〉)) = {(e`, q1 + |X`| − e` + 1) | q1 ∈ 〈a1, d, t1〉}

= {(e`, 〈a + |X`| − e` + 1, d, t1〉} and

ExtXi((1, 〈a3, d, t3〉)) = {(|X`| + er − q3, |X`| + er) | q3 ∈ 〈a3, d, t3〉}

= {(〈|X`| + er − a − (t − 1)d, d, t3〉, |X`| + er)}.

Now let us consider 〈a + t1d, d, t2〉. It is easy to see that t2 ≤ 1. We considerthe case where t2 = 1 and a2 = a + t1d = q2. Notice that the palindrome(1, a2) can be expanded beyond the periodicity w.r.t. d. Thus,

ExtXi((1, a2)) = {(|X`|−z+1, |X`|+a2+z)} = {(|X`|−z+1, |X`|+a+t1d+z)},

where z = FM (XR` , Xr, a2+1)+a2. Therefore, the set of expanded palindromes

can be represented as follows:

ExtXi({1, 〈a, d, t〉} ⊕ |X`|) = {(e`, 〈a + |X`| − e` + 1, d, t1〉}

∪{(〈|X`| + er − a − (t − 1)d, d, t3〉, |X`| + er)}∪{(|X`| − z + 1, |X`| + a + t1d + z)}.

Hence ExtXi({1, 〈a, d, t〉}) can be represented by at most 2 arithmetic progres-

sions and a palindrome, which in total require a constant space. We remarkthat similar arguments hold for ExtXi

(〈a′, d′, t′〉, |X`|). 2

25

Xi

Xl Xr

el er

x y

z <a1, d, t1>

<a3, d, t3><a2, d, t2>

Fig. 10. Illustration for the proof of Lemma 12.

26