Upload
alan-shropshire
View
216
Download
1
Embed Size (px)
Citation preview
1
DICTIONARY MATCHING
WITH ONE GAP
Amihood Amir, Avivit Levy ,Ely Porat and B. Riva
Shalom CPM
20
14
4
OUTLINE
The DMG(Dictionary Matching with one
Gap ) ProblemMotivationPrevious WorkBidirectional Suffix Trees SolutionLookup Table additionOpen Problems
CPM
20
14
5
THE DMG PROBLEM
A gapped pattern is a pattern P of the form:P1{1,1} P2{2,2}… Pk-1{k-1,k-1}Pk
Each Pj is over alphabet ,{j,j} is a sequence of at least j and at most j don’t cares = @.
Example: aba{3,6}cbb aba @@@cbb aba@@@@cbb aba@@@@@cbb aba@@@@@@cbb
CPM
20
14
6
THE DMG PROBLEM
The DMG problem is:Preprocess: A dictionary D of d gapped
patterns P1,…, Pd over alphabet .
Query: A text T of length n over alphabet .
Output: all locations in T where a dictionary gapped pattern ends.
We focus on DMG with a single gap.
CPM
20
14
7
EXAMPLE
Dictionary: P1 = aba {3,6} cbb
P2 = ab {3,6} bbac
P3 = aa {3,6} ac
Query 1 2 3 4 5 6 7 8 9 10 11
text: a b a a b a c b b a c
P1,1 P1,2P2,1
P2,2P3,1 P3,2
CPM
20
14
First =1≤i≤d{ Pi,1 } Second=1≤i≤d{ Pi,2 }
8
MOTIVATION
Computational BiologyA renew interest due to cyber
security. Network intrusion detection systems
perform protocol analysis, content searching and content matching to detect harmful software.
Malware may appear in several packets!
CPM
20
14
9
PREVIOUS WORK
Gapped pattern matching problem was studied for a few decades,eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009] , [Bille&Thorup SODA 2010],
[Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012]
DMG problem not studied enough ![Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap.
CPM
20
14
10
BI-DIRECTIONAL SUFFIX TREES ALGORITHM
Gapped pattern: a b{3,6}b b a c
Query: a b a a b a c b b a c
CPM
20
14
11
BI-DIRECTIONAL SUFFIX TREES ALGORITHMIdea: view as [Amir et al., JAL 2000]
Gapped patterns:P1= a b a{3,6}a b a c P2= a b a{3,6}b b a P3= a b{3,6}b a aQuery:
a b a a b a c b b a cUse suffix tree TS of Second
Use suffix tree TF
R ofFirstR
gap
CPM
20
14
12
BI-DIRECTIONAL SUFFIX TREES ALGORITHMFor each text location l
Insert tl tl +1…tn to TS (the node h)
to find labels on the path to h.
For f= l --1 to l --1Insert tftf-1…t1 to TFR (the node g)
to find labels on the path to g.
Output intersection (for end locations).
Finds Pi,2 starting at location l.
Finds Pi,1 ending at location f.
CPM
20
14
13
BI-DIRECTIONAL SUFFIX TREES ALGORITHM - INTERSECTIONPatterns: {(1,4),(2,9),(3,7),…,(6,5),…}
TSTFR
Range:[1,9]
Range:[2,7]
CPM
20
14
3
69
1
g
5
7
2
h
14
BI-DIRECTIONAL SUFFIX TREES ALGORITHM (CONTINUED)Intersection via range queries:
Range:[2,7]
Range: [1,9]
(1,4)
(3,7)
(6,5)
(8,8)
(2,9)
CPM
20
14
15
TIME & SPACE Preprocessing Time:Dictionary segments suffix tree and reverse
suffix tree: O(|D|)Preprocessing grid for range queries:
O(d log d). [Chan et al., SoCG 2011]
Preprocessing Space:Dictionary segments suffix tree and reverse
suffix tree: O(|D|)Space for grid:
O(d log d). [Chan et al., SoCG 2011]
CPM
20
14
16
TIME & SPACE Query Time:For each end text location, we try every gap
size: a factor of .The number of range queries is the number of
vertical paths in a given path: O(log2 min{d, log |D|}).A range query costs: O(log log d+occ).
[Chan et al., SoCG 2011]
Total: O(n()log log d log2 min{d, log |D|}+occ).
CPM
20
14
369
1
g
17
LOOKUP TABLE ALGORITHM
Idea: Instead of using range queries in a
grid to compute the intersection, we use a pre-computed lookup table.
Enables intersection in O(occ) time.
Total query time becomes:O(n()+occ).
CPM
20
14
18
LOOKUP TABLE ALGORITHM
Inter[g,h] = all i s.t. Pi,1R appears on
the path from the root of TFR till node
g and Pi,2 appears on the path from the root of TS till node h.
CPM
20
14
369
1
57
2
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7
=(9,6)Inter[ 3, 5 ]= {4}
g h
19
LOOKUP TABLE ALGORITHM
Inter[g,h] = all i s.t. Pi,1R appears on
the path from the root of TFR till node
g and Pi,2 appears on the path from the root of TS till node h.
CPM
20
14
369
1
57
2
P1=(1,4), P2=(2,9),
P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9, 6)Inter[ 3, 5 ]= {4}
Inter[ 3, 7 ]= {3,4}
g
h
20
LOOKUP TABLE ALGORITHM
Inter[g,h] = all i s.t. Pi,1R appears on
the path from the root of TFR till node
g and Pi,2 appears on the path from the root of TS till node h.
CPM
20
14
369
1
57
2
P1=(1,4), P2=(2,9), P3=(3,7),
P4=(3,2), …,P6=(6,5), P7
=(9,6)Inter[ 3, 5 ]= {4}Inter[ 3, 7 ]= {3,4}Inter[ 6, 7 ]= {3,4,6} g
h
21
LOOKUP TABLE ALGORITHM
Inter[g,h] = all i s.t. Pi,1R appears on
the path from the root of TFR till node
g and Pi,2 appears on the path from the root of TS till node h.
CPM
20
14
369
1
57
2
P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7
=(9,6)Inter[ 3, 5 ]= {4}Inter[ 3, 7 ]= {3,4}Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} g h
22
LOOKUP TABLE ALG.
CPM
20
14
369
1
57
2
P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2),
…,P6=(6,5),P7 =(9,6)
Inter[3,5]= {4}Inter[3,7]= {3,4}Inter[6,7]= {3,4,7}
1
3
:
1
9
6
.…2 5 6 7
2
:
--41--
--
6
3
--
4
7
23
LOOKUP TABLE ALGORITHM
Preprocessing:Time: Table can be computed using DP
in time O(d2 ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix.
Space: O(d 2 + |D|).
Query time: O(n()+occ).
CPM
20
14
24
OUR RESULTS
Preprocessing time: O(d log d + |D|).Space: O(d log d + |D|).Query time:
O(n()log log d log2(min{d, log |D|} )+occ).
Preprocessing time: O(d2 ovr + |D|).Space: O(d 2 + |D|).Query time: O(n()+occ).
Bi-directional suffix trees & range queries
Bi-directional suffix trees & Lookup table
CPM
20
14
25
OPEN PROBLEMS
Generalizing to k gapsReducing the dependency on the size
Scalability to different gap bounds in the dictionary
Online algorithm
CPM
20
14