Orgad Keller - Algorithms 2 - Recitation 12 2
Less Than Matching
Input: A text , a pattern
over alphabet with order relation . Output: All locations where
Can we use the regular methods?
i
0 1, j i jj m p t
0 1... nT t t 0 1... mP p p
i jt
jp
iT
P
Orgad Keller - Algorithms 2 - Recitation 12 3
Transitivity
Less Than Matching is in fact transitive, but that is not enough for us:
does not imply anything about the relation between and .
,a c b c a b
Orgad Keller - Algorithms 2 - Recitation 12 4
Approach
A good approach for solving Pattern Matching problems is sometimes solving:The problem for a binary alphabet .The problem for a bounded alphabet .The problem for an ubounded alphabet .
In that order.
0,1
Orgad Keller - Algorithms 2 - Recitation 12 5
Binary Alphabet
The only case that prevents a match at location is the case where:
This is equivalent to:
So how can we solve this case?
0 1, 1 0j i jj m p t
i
0 1, 1 1j i jj m p t
Orgad Keller - Algorithms 2 - Recitation 12 6
Binary Alphabet
So if , there is no match at .
We can calculate Then we’ll calculate using FFT.We’ll return all locations where
Time: .
1
0
0m
j i jj
p t
i
0 1... nT t t RT P
( )[ 1] 0RT P i m
( log )O n m
i
Orgad Keller - Algorithms 2 - Recitation 12 7
Bounded Alphabet
We need reductions to binary alphabet. For each we’ll define:
We notice are binary.
0 1
1
0
...
ii
i
n
tt
t
T t t
0 1
1
0
...
ii
i
m
pp
p
P p p
,T P
Orgad Keller - Algorithms 2 - Recitation 12 8
Bounded Alphabet
Theorem: (less than) matches at location if and only if , (less than) matches at location .
Proof: does not match at iff .
that is true iff , meaning that does not (less than) match at location .
PP T
iT i
P T i, j i jj p t
1 0j i jp t
P
iT
Orgad Keller - Algorithms 2 - Recitation 12 9
Bounded Alphabet
So for each , we’ll run the binary alphabet algorithm on .
We’ll return only the locations that matched in all iterations.
Time: .
,T P
( log )O n m
Orgad Keller - Algorithms 2 - Recitation 12 10
Unbounded Alphabet
Running the bounded alphabet algorithm could result in a time algorithms (We’ll run it only for alphabet symbols which are actually in pattern).
Can be worse than the naïve algorithm. We present an improvement on the next
slides.
(min , log )O m n m
Orgad Keller - Algorithms 2 - Recitation 12 11
First, use the segment splitting trick. Therefore we can assume .
For each location in text, we’ll produce a triplet: , where .
For each location in pattern, we’ll produce a triplet: , where .
We now have triplets all together.
Abrahamson-Kosaraju Method
2T m
( , ' ', )a T ii
ip bi
( , ' ', )b P i
3m
it a
Orgad Keller - Algorithms 2 - Recitation 12 12
Abrahamson-Kosaraju Method
We’ll hold all triplets together. Sort all triplets according to symbol. We’ll define a symbol that has more than
triplets as a “frequent symbol”. There are frequent symbols. Put all frequent symbols’ triplets aside.
m
( )O m
Orgad Keller - Algorithms 2 - Recitation 12 13
Abrahamson-Kosaraju Method
Split non-frequent symbols’ triplets to groups of size in the following manner:
2m S m
2 1
3 2
Group 1
1 3
2 4
( , ' ', 4), ( , ' ',7),..., ( , ' ',300) , ( , ' ',3),..., ( , ' ', 200) ,
( , ' ',5),..., ( , ' ',1000) , ( , ' ',5),..., ( , ' ',150)
m m
m m
a T a T a P b T b T
d P d T g P g T
Group 2
,...
Orgad Keller - Algorithms 2 - Recitation 12 14
Abrahamson-Kosaraju Method
The rule is that there can’t be two triplets of the same symbol in different groups.
2 1
3 2
Group 1
1 3
2 4
( , ' ', 4), ( , ' ',7),..., ( , ' ',300) , ( , ' ',3),..., ( , ' ', 200) ,
( , ' ',5),..., ( , ' ',1000) , ( , ' ',5),..., ( , ' ',150)
m m
m m
a T a T a P b T b T
d P d T g P g T
Group 2
,...
Orgad Keller - Algorithms 2 - Recitation 12 15
Abrahamson-Kosaraju Method
For each such group, choose the symbol of the first triplet in group as the group’s representative.
For instance, on previous example, group 1’s representative is and group 2’s representative is .
There are representatives all together.
ad
( )O m
Orgad Keller - Algorithms 2 - Recitation 12 16
Abrahamson-Kosaraju Method
To sum up: frequent symbols. representatives of non-frequent
symbols. We’ll swap each non-frequent symbol in
pattern and text with its representative. Now our text and pattern are over
sized alphabet.
( )O m
( )O m
( )O m
Orgad Keller - Algorithms 2 - Recitation 12 17
Abrahamson-Kosaraju Method
We want to run our algorithm over the new text and pattern to count the mismatches between symbols of different groups.
But we have a problem:Let’s say is a frequent symbol, but:
1 3
2 4
Group 2
..., ( , ' ',5),..., ( , ' ',1000) , ( , ' ',5),..., ( , ' ',150) ,...
m m
d P d T g P g T
f
Orgad Keller - Algorithms 2 - Recitation 12 18
Abrahamson-Kosaraju Method
The representative of group 2 is , which is smaller than , but the group also contains which is greater than .
1 3
2 4
Group 2
..., ( , ' ',5),..., ( , ' ',1000) , ( , ' ',5),..., ( , ' ',150) ,...
m m
d P d T g P g T
ff
d
g
Orgad Keller - Algorithms 2 - Recitation 12 19
Abrahamson-Kosaraju Method
In that case we’ll split group 2 to two groups with their own representatives.
Since we performed at most such splits, we still have representatives.
1 3
2 4
Group 2.1 Group 2.2
..., ( , ' ',5),..., ( , ' ',1000) , ( , ' ',5),..., ( , ' ',150) ,...
m m
d P d T g P g T
( )O m
( )O m
Orgad Keller - Algorithms 2 - Recitation 12 20
Abrahamson-Kosaraju Method
We can now run our algorithm over the new text and pattern in .
But we still haven’t handled comparisons between two non-frequent symbols that are in the same group.
( log )O mm m
Orgad Keller - Algorithms 2 - Recitation 12 21
Abrahamson-Kosaraju Method
We’ll do so naively in each group:For each triplet in the group
For each triplet of the form in the group, if , then add an error at location
.
Time: ( )O m m
( , ' ', )P j ( , ' ', )T k
i k j
ktjp
iT
P
j kp t
i j
Orgad Keller - Algorithms 2 - Recitation 12 22
Running Time
For one segment:Sorting the triplets and representatives:
.Running the algorithm: .Correcting results (Adding in-group errors):
. Overall for one segment: . Overall for all segments: .
( log )O m m
( log )O mm m
( )O m m
( log )O m m m
( log )O n m m