36
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan

Computing longest common substring and all palindromes from compressed strings

  • Upload
    tress

  • View
    21

  • Download
    3

Embed Size (px)

DESCRIPTION

Computing longest common substring and all palindromes from compressed strings. Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan - PowerPoint PPT Presentation

Citation preview

Page 1: Computing longest common substring  and all palindromes  from compressed strings

Computing longest common substring and all palindromes

from compressed stringsWataru Matsubara1, Shunsuke Inenaga2,

Akira Ishino1, Ayumi Shinohara1, Tomoyuki Nakamura1, Kazuo Hashimoto1

1Graduate School of Information SciencesTohoku University, Japan

2Department of Computer Science and Communication Engineering,

Kyushu University, Japan

Page 2: Computing longest common substring  and all palindromes  from compressed strings

Background and motivations

Page 3: Computing longest common substring  and all palindromes  from compressed strings

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

input text

Page 4: Computing longest common substring  and all palindromes  from compressed strings

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

input text

find palindromes

output

mmisizziprefrepiborroworrobwasitabarorabatisowoo :

Page 5: Computing longest common substring  and all palindromes  from compressed strings

What is compressed string algorithm?

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :

find palindromes

output

mmisizziprefrepiborroworrobwasitabarorabatisowoo :

decompress

e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…

e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…

compressed text One solution would be to decompress the compressed text.

The decompressed size can be exponentially large with respect to the compressed size.

decompressed text

Page 6: Computing longest common substring  and all palindromes  from compressed strings

Goal of algorithms for Compressed strings

• Process the compressed text without decompression.

• Processing time should be polynomial in n.– Decompressed size can be exponentially large

with respect to n.

n : the size of compressed text

Page 7: Computing longest common substring  and all palindromes  from compressed strings

Compressed schemes

• run-length encoding• Lempel-Ziv• grammar based compression :

Straight Line Program

[Rytter2003] Resulting achieve of most practical

compression methods can be transformed into SLP generating the same original text.

[Rytter2003] Resulting achieve of most practical

compression methods can be transformed into SLP generating the same original text.

Page 8: Computing longest common substring  and all palindromes  from compressed strings

SLP TSLP T

T : sequence of assignments     X1 = expr1 ; X2 = expr2; … ; Xn = exprn;

Xk : variable,a ( a   Xi Xj ( i, j < k ).

exprk :

Definition of Straight Line Program (SLP)

SLP T for string w is a CFG in Chomsky normal form s.t. L(T) = {w}.

Page 9: Computing longest common substring  and all palindromes  from compressed strings

Straight Line Program (SLP)Example

X1 = a

X2 = b

X3 = X1X2

X4 = X3X1

X5 = X3X4

X6 = X5X5

X7 = X4X6

X8 = X7X5

n

NN = O(2n)

T =

SLP

Page 10: Computing longest common substring  and all palindromes  from compressed strings

Straight Line Program (SLP)Example

X1 = a

X2 = b

X3 = X1X2

X4 = X3X1

X5 = X3X4

X6 = X5X5

X7 = X4X6

X8 = X7X5

n

NN = O(2n)

T =

SLP

X8

X7 X5

Page 11: Computing longest common substring  and all palindromes  from compressed strings

Efficient algorithms for compressed strings

• substring matching– Karpinski et al (1996) O(n4logn) time

– Miyazaki et al (1997) O(n4) time

– Lifshits (2006) O(n3) time

• minimum period– Karpinski et al (1996) O(n4logn) time

– Lifshits (2006) O(n3logN) time

• all squares– Gasieniec et al (1994) O(n6log5N) time

Page 12: Computing longest common substring  and all palindromes  from compressed strings

Hardness results

• Subsequence pattern matching– Lifshits and Lohrey (2006) NP-hard  

• Longest common subsequence– Lifshits and Lohrey (2006)     NP-hard

 • Hamming distance

– Lifshits (2007)     #P-complete

Is there any reasonable comparison measurement for compressed strings?

Page 13: Computing longest common substring  and all palindromes  from compressed strings

a b a a b a

a a b b a a

String comparison measures

a b a a b a

a a b b a a

Hamming distance

Longest common subsequence

Longest common substring

#P-comprete[Lifshits 07]

NP-hard[Lifshits and Lohrey06]

??

O(N)uncompressed text

compressed text

O(N2 / logN) O(N)

we solve this problemwe solve this problem

a b a a b a

a a b b a a

Page 14: Computing longest common substring  and all palindromes  from compressed strings

Our results

Page 15: Computing longest common substring  and all palindromes  from compressed strings

ProblemGiven two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S).

LCStr(T, S) : the length of longest common substring of T and Sn : the total size of the input SLP

Our Result1: Longest Common Substring

TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.

TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.

Page 16: Computing longest common substring  and all palindromes  from compressed strings

ProblemGiven SLP T, compute (compressed representations) the set of all palindromes of T.

n : the size of SLP TN : the length of original text T (note that N = O(2n)

Previous best result: O(n5log4N) time

Our Result2: palindromes

[Gasienec et al 1996]

TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.

TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.

Page 17: Computing longest common substring  and all palindromes  from compressed strings

Details of our algorithm

Computing longest common substringComputing palindromes (omitted in this talk)

Page 18: Computing longest common substring  and all palindromes  from compressed strings

Property of common substrings (1/3)

• For each common substring Z of string S and T, there always exists a variable Xi = XlXr and Yj = YLYR

such that:– Z is a common substring of Xi and Yj

– Z contains an overlap between Xl and YR

common substringcommon substring

ZZ

ZZ

XiXl Xr

Yj

YL YR

ww

OverlapOverlap

Page 19: Computing longest common substring  and all palindromes  from compressed strings

Property of common substrings (2/3)

• For each common substring Z of string S and T, there always exists a string w such that:– w is a substring of Z– w is an overlap of variables of S and T

ww

XiXl Xr

Yj

YL YR

OverlapOverlap

Page 20: Computing longest common substring  and all palindromes  from compressed strings

Property of common substrings (3/3)

• For each common substring Z of string S and T, there always exists a string w such that:– Z can be calculate by expanding w

common substringcommon substring

wwZZ

ZZ

XiXl Xr

Yj

YL YR Extend ProcessExtend Process

OverlapOverlap

Page 21: Computing longest common substring  and all palindromes  from compressed strings

For any strings X, Y,

Overlaps (OL)

the set of the lengths of overlaps of X and Y.

the set of the lengths of overlaps of X and Y.

X

Y

Page 22: Computing longest common substring  and all palindromes  from compressed strings

a a b a a b a

OverlapsExample

OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl

a b a a b a a b a bYR

a b a a b a a b a bYR

a b a a b a a b a bYR

Page 23: Computing longest common substring  and all palindromes  from compressed strings

Computing Overlaps[Karpinski et al 1996]

LemmaFor any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions.

Xi

Yj

Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n4logn) time and O(n3) space.

Page 24: Computing longest common substring  and all palindromes  from compressed strings

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

Page 25: Computing longest common substring  and all palindromes  from compressed strings

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

Page 26: Computing longest common substring  and all palindromes  from compressed strings

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

Page 27: Computing longest common substring  and all palindromes  from compressed strings

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

matchmatch

Page 28: Computing longest common substring  and all palindromes  from compressed strings

a b a ∈ OL(Xl, YR)

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YRYL

mismatchmismatch

Page 29: Computing longest common substring  and all palindromes  from compressed strings

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YrYl

a b a ∈ OL(Xl, YR)

mismatchmismatch

Page 30: Computing longest common substring  and all palindromes  from compressed strings

How to extend overlaps

a a a b a b a b a a b a b a b b

a a b a a b a a b a b a b a a b a

Xl Xr

Xi

Yj

YrYl

a b a ∈ OL(Xl, YR)

We are not allowed to process character by character.

We are not allowed to process character by character.

Page 31: Computing longest common substring  and all palindromes  from compressed strings

First-mismatch function[Karpinski et al 1996]

input : SLP variables Xi and Yj , integer k

output : position of first mismatch

Mismatch

Mismatch

k Yj

a b a b a a b a b a a b

Xi

a b a b a b a a b a

p p [p]}-1

Page 32: Computing longest common substring  and all palindromes  from compressed strings

First-mismatch function[Karpinski et al 1996]

Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k) can be computed in O(nlogn) time.

Page 33: Computing longest common substring  and all palindromes  from compressed strings

Extending overlaps using FM function

Lemma Extending overlaps can be done by O(n) calls of FM function.

Page 34: Computing longest common substring  and all palindromes  from compressed strings

O(n2) items O(n2) items

pseudo-code Computing longest common substring

O(n) calls of FM function.O(n) calls of FM function.

O(nlogn) timesO(nlogn) times

Totally, LCStr (S, T) can be computed inO(n2×n×nlogn ) = O ( n4logn ) time.

Page 35: Computing longest common substring  and all palindromes  from compressed strings

Conclusions

• Computing longest common substring from compressed string– O(n4logn) time and O(n3) space

• Computing all palindromes from compressed string– O(n4) time and O(n2) space

Page 36: Computing longest common substring  and all palindromes  from compressed strings

Thank you for your attention.