Upload
tress
View
21
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Computing longest common substring and all palindromes from compressed strings. Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan - PowerPoint PPT Presentation
Citation preview
Computing longest common substring and all palindromes
from compressed stringsWataru Matsubara1, Shunsuke Inenaga2,
Akira Ishino1, Ayumi Shinohara1, Tomoyuki Nakamura1, Kazuo Hashimoto1
1Graduate School of Information SciencesTohoku University, Japan
2Department of Computer Science and Communication Engineering,
Kyushu University, Japan
Background and motivations
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
input text
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
input text
find palindromes
output
mmisizziprefrepiborroworrobwasitabarorabatisowoo :
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
find palindromes
output
mmisizziprefrepiborroworrobwasitabarorabatisowoo :
decompress
e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…
e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…
compressed text One solution would be to decompress the compressed text.
The decompressed size can be exponentially large with respect to the compressed size.
decompressed text
Goal of algorithms for Compressed strings
• Process the compressed text without decompression.
• Processing time should be polynomial in n.– Decompressed size can be exponentially large
with respect to n.
n : the size of compressed text
Compressed schemes
• run-length encoding• Lempel-Ziv• grammar based compression :
Straight Line Program
[Rytter2003] Resulting achieve of most practical
compression methods can be transformed into SLP generating the same original text.
[Rytter2003] Resulting achieve of most practical
compression methods can be transformed into SLP generating the same original text.
SLP TSLP T
T : sequence of assignments X1 = expr1 ; X2 = expr2; … ; Xn = exprn;
Xk : variable,a ( a Xi Xj ( i, j < k ).
exprk :
Definition of Straight Line Program (SLP)
SLP T for string w is a CFG in Chomsky normal form s.t. L(T) = {w}.
Straight Line Program (SLP)Example
X1 = a
X2 = b
X3 = X1X2
X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5
n
NN = O(2n)
T =
SLP
Straight Line Program (SLP)Example
X1 = a
X2 = b
X3 = X1X2
X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5
n
NN = O(2n)
T =
SLP
X8
X7 X5
Efficient algorithms for compressed strings
• substring matching– Karpinski et al (1996) O(n4logn) time
– Miyazaki et al (1997) O(n4) time
– Lifshits (2006) O(n3) time
• minimum period– Karpinski et al (1996) O(n4logn) time
– Lifshits (2006) O(n3logN) time
• all squares– Gasieniec et al (1994) O(n6log5N) time
Hardness results
• Subsequence pattern matching– Lifshits and Lohrey (2006) NP-hard
• Longest common subsequence– Lifshits and Lohrey (2006) NP-hard
• Hamming distance
– Lifshits (2007) #P-complete
Is there any reasonable comparison measurement for compressed strings?
a b a a b a
a a b b a a
String comparison measures
a b a a b a
a a b b a a
Hamming distance
Longest common subsequence
Longest common substring
#P-comprete[Lifshits 07]
NP-hard[Lifshits and Lohrey06]
??
O(N)uncompressed text
compressed text
O(N2 / logN) O(N)
we solve this problemwe solve this problem
a b a a b a
a a b b a a
Our results
ProblemGiven two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S).
LCStr(T, S) : the length of longest common substring of T and Sn : the total size of the input SLP
Our Result1: Longest Common Substring
TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.
TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.
ProblemGiven SLP T, compute (compressed representations) the set of all palindromes of T.
n : the size of SLP TN : the length of original text T (note that N = O(2n)
Previous best result: O(n5log4N) time
Our Result2: palindromes
[Gasienec et al 1996]
TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.
TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.
Details of our algorithm
Computing longest common substringComputing palindromes (omitted in this talk)
Property of common substrings (1/3)
• For each common substring Z of string S and T, there always exists a variable Xi = XlXr and Yj = YLYR
such that:– Z is a common substring of Xi and Yj
– Z contains an overlap between Xl and YR
common substringcommon substring
ZZ
ZZ
XiXl Xr
Yj
YL YR
ww
OverlapOverlap
Property of common substrings (2/3)
• For each common substring Z of string S and T, there always exists a string w such that:– w is a substring of Z– w is an overlap of variables of S and T
ww
XiXl Xr
Yj
YL YR
OverlapOverlap
Property of common substrings (3/3)
• For each common substring Z of string S and T, there always exists a string w such that:– Z can be calculate by expanding w
common substringcommon substring
wwZZ
ZZ
XiXl Xr
Yj
YL YR Extend ProcessExtend Process
OverlapOverlap
For any strings X, Y,
Overlaps (OL)
the set of the lengths of overlaps of X and Y.
the set of the lengths of overlaps of X and Y.
X
Y
a a b a a b a
OverlapsExample
OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl
a b a a b a a b a bYR
a b a a b a a b a bYR
a b a a b a a b a bYR
Computing Overlaps[Karpinski et al 1996]
LemmaFor any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions.
Xi
Yj
Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n4logn) time and O(n3) space.
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
mismatchmismatch
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YrYl
a b a ∈ OL(Xl, YR)
mismatchmismatch
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YrYl
a b a ∈ OL(Xl, YR)
We are not allowed to process character by character.
We are not allowed to process character by character.
First-mismatch function[Karpinski et al 1996]
input : SLP variables Xi and Yj , integer k
output : position of first mismatch
Mismatch
Mismatch
k Yj
a b a b a a b a b a a b
Xi
a b a b a b a a b a
p p [p]}-1
First-mismatch function[Karpinski et al 1996]
Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k) can be computed in O(nlogn) time.
Extending overlaps using FM function
Lemma Extending overlaps can be done by O(n) calls of FM function.
O(n2) items O(n2) items
pseudo-code Computing longest common substring
O(n) calls of FM function.O(n) calls of FM function.
O(nlogn) timesO(nlogn) times
Totally, LCStr (S, T) can be computed inO(n2×n×nlogn ) = O ( n4logn ) time.
Conclusions
• Computing longest common substring from compressed string– O(n4logn) time and O(n3) space
• Computing all palindromes from compressed string– O(n4) time and O(n2) space
Thank you for your attention.