Computing longest common substring and all palindromes from compressed strings

Preview:

DESCRIPTION

Computing longest common substring and all palindromes from compressed strings. Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan - PowerPoint PPT Presentation

Citation preview

Computing longest common substring and all palindromes

from compressed stringsWataru Matsubara1, Shunsuke Inenaga2,

Akira Ishino1, Ayumi Shinohara1, Tomoyuki Nakamura1, Kazuo Hashimoto1

1Graduate School of Information SciencesTohoku University, Japan

2Department of Computer Science and Communication Engineering,

Kyushu University, Japan

Background and motivations

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

input text

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

input text

find palindromes

output

mmisizziprefrepiborroworrobwasitabarorabatisowoo :

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

find palindromes

output

mmisizziprefrepiborroworrobwasitabarorabatisowoo :

decompress

e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…

e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…

compressed text One solution would be to decompress the compressed text.

The decompressed size can be exponentially large with respect to the compressed size.

decompressed text

Goal of algorithms for Compressed strings

• Process the compressed text without decompression.

• Processing time should be polynomial in n.– Decompressed size can be exponentially large

with respect to n.

n : the size of compressed text

Compressed schemes

• run-length encoding• Lempel-Ziv• grammar based compression :

Straight Line Program

[Rytter2003] Resulting achieve of most practical

compression methods can be transformed into SLP generating the same original text.

[Rytter2003] Resulting achieve of most practical

compression methods can be transformed into SLP generating the same original text.

SLP TSLP T

T : sequence of assignments     X1 = expr1 ; X2 = expr2; … ; Xn = exprn;

Xk : variable,a ( a   Xi Xj ( i, j < k ).

exprk :

Definition of Straight Line Program (SLP)

SLP T for string w is a CFG in Chomsky normal form s.t. L(T) = {w}.

Straight Line Program (SLP)Example

X1 = a

X2 = b

X3 = X1X2

X4 = X3X1

X5 = X3X4

X6 = X5X5

X7 = X4X6

X8 = X7X5

n

NN = O(2n)

T =

SLP

Straight Line Program (SLP)Example

X1 = a

X2 = b

X3 = X1X2

X4 = X3X1

X5 = X3X4

X6 = X5X5

X7 = X4X6

X8 = X7X5

n

NN = O(2n)

T =

SLP

X8

X7 X5

Efficient algorithms for compressed strings

• substring matching– Karpinski et al (1996) O(n4logn) time

– Miyazaki et al (1997) O(n4) time

– Lifshits (2006) O(n3) time

• minimum period– Karpinski et al (1996) O(n4logn) time

– Lifshits (2006) O(n3logN) time

• all squares– Gasieniec et al (1994) O(n6log5N) time

Hardness results

• Subsequence pattern matching– Lifshits and Lohrey (2006) NP-hard  

• Longest common subsequence– Lifshits and Lohrey (2006)     NP-hard

 • Hamming distance

– Lifshits (2007)     #P-complete

Is there any reasonable comparison measurement for compressed strings?

a b a a b a

a a b b a a

String comparison measures

a b a a b a

a a b b a a

Hamming distance

Longest common subsequence

Longest common substring

#P-comprete[Lifshits 07]

NP-hard[Lifshits and Lohrey06]

??

O(N)uncompressed text

compressed text

O(N2 / logN) O(N)

we solve this problemwe solve this problem

a b a a b a

a a b b a a

Our results

ProblemGiven two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S).

LCStr(T, S) : the length of longest common substring of T and Sn : the total size of the input SLP

Our Result1: Longest Common Substring

TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.

TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.

ProblemGiven SLP T, compute (compressed representations) the set of all palindromes of T.

n : the size of SLP TN : the length of original text T (note that N = O(2n)

Previous best result: O(n5log4N) time

Our Result2: palindromes

[Gasienec et al 1996]

TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.

TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.

Details of our algorithm

Computing longest common substringComputing palindromes (omitted in this talk)

Property of common substrings (1/3)

• For each common substring Z of string S and T, there always exists a variable Xi = XlXr and Yj = YLYR

such that:– Z is a common substring of Xi and Yj

– Z contains an overlap between Xl and YR

common substringcommon substring

ZZ

ZZ

XiXl Xr

Yj

YL YR

ww

OverlapOverlap

Property of common substrings (2/3)

• For each common substring Z of string S and T, there always exists a string w such that:– w is a substring of Z– w is an overlap of variables of S and T

ww

XiXl Xr

Yj

YL YR

OverlapOverlap

Property of common substrings (3/3)

• For each common substring Z of string S and T, there always exists a string w such that:– Z can be calculate by expanding w

common substringcommon substring

wwZZ

ZZ

XiXl Xr

Yj

YL YR Extend ProcessExtend Process

OverlapOverlap

For any strings X, Y,

Overlaps (OL)

the set of the lengths of overlaps of X and Y.

the set of the lengths of overlaps of X and Y.

X

Y

a a b a a b a

OverlapsExample

OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl

a b a a b a a b a bYR

a b a a b a a b a bYR

a b a a b a a b a bYR

Computing Overlaps[Karpinski et al 1996]

LemmaFor any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions.

Xi

Yj

Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n4logn) time and O(n3) space.

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

mismatchmismatch

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YrYl

a b a ∈ OL(Xl, YR)

mismatchmismatch

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YrYl

a b a ∈ OL(Xl, YR)

We are not allowed to process character by character.

We are not allowed to process character by character.

First-mismatch function[Karpinski et al 1996]

input : SLP variables Xi and Yj , integer k

output : position of first mismatch

Mismatch

Mismatch

k Yj

a b a b a a b a b a a b

Xi

a b a b a b a a b a

p p [p]}-1

First-mismatch function[Karpinski et al 1996]

Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k) can be computed in O(nlogn) time.

Extending overlaps using FM function

Lemma Extending overlaps can be done by O(n) calls of FM function.

O(n2) items O(n2) items

pseudo-code Computing longest common substring

O(n) calls of FM function.O(n) calls of FM function.

O(nlogn) timesO(nlogn) times

Totally, LCStr (S, T) can be computed inO(n2×n×nlogn ) = O ( n4logn ) time.

Conclusions

• Computing longest common substring from compressed string– O(n4logn) time and O(n3) space

• Computing all palindromes from compressed string– O(n4) time and O(n2) space

Thank you for your attention.

Recommended