Upload
lionel-marsh
View
217
Download
0
Embed Size (px)
Citation preview
Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio
Tuning Algorithms for Jumbeled Matching
Jumbled matching
Interesting variation of string matching.
To find substrings of T which are permutations of P.
For example: P=abcb in T=aababcaabc.
2
Jumbled matching
Parikh Vector- The pattern can be described as parikh vector.
Vector of multiplicities of the characters.
p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}.
3
Approximate Permutaion Matching
The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds
set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u.
4
Motivation
Alignment of strings
SNP discovery
Discovery of repeated patterns
Interpretation of mass spectrometry data
5
Previous Algorithms
Key Idea- scan the text forward while maintaining counts of characters.
Work in linear time.
These algorithms were developed as filtration methods for online approximate string matching.
6
Previous Algorithms
Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters.
Navarro’s counting algorithm - sliding window approach.
7
Previous Algorithms
Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters.
It grows with the acceptable characters.
Navarro presented a Mcount for multiple patterns (Proc. WSP 1997) .
8
Previous Algorithms
Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher).
Associate a counter(bin) to each distinct character in P.
A single 1-bit counter for the remaining characters of the alphabet.
9
Previous Algorithms
At the start of processing a window, every overflow bit is zero.
1-bit counter reserved for all the characters not occurring in p is initially null.
And it gets set as soon as any character not in p is encountered in the text window.
It becomes clear that the text window cannot be a permutation of the pattern P.
10
Bit Parallel simulation
P = abbccc
c b a other characters
11
Initialization for state vector
c
b a
P = abbccc
All other characters
12
Forward Processing
13
Backward Processing
14
New solutions
Solutions for both exact and approximate jumbled matching.
We present two algorithms that are modifications of BAM.
ABAM (approximate BAM).
BAM2 (enhanced BAM with 2-grams).
15
Key Idea: Counters
We used bit fields to store counters.• For each character that appears in the pattern.
• One for all other characters.
Highest bit is an overflow indicator.
Space to represent number of times the character appears in the pattern + maximum error count k.
16
State Vector D
Counters are stored in state vector D.
If they do not fit in one word• We can put several different characters in one field.
• But then we must verify matches.
Initial vales of D are fetched from precomputed word.
Processing of each character is made by using array M[tj] which has the one in the field for tj.
Value of D is updated by D D + M[tj].
17
Initialization for state vector D and M[ ] for pattern P = abbccc
0 0 0 0 0 0 1 0 0
abcAll other characters x
0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0
M[a]
M[b]
M[c]
M[x]
I
18
Variations of BAM
BAMs• Some bins are shared if necessary.
• If bins are shared, each match candidate needs to be verified.
BAM2• Handles 2 text characters (2-gram) at a time.
• Separate loop for patterns of even and odd length.
• Reads four characters before testing D first time.
• Hence the minimum width of a field is four bits instead of two.
19
ABAM
ABAM : Approximate BAM.
C is the error counter.
F[tj] is mask for testing overflow bits.
20
EBL (Exact Backward for Large alphabets)
EBL is based on SBNDM2.
Instead of representing occurrence vectors.• Array B states of a character is present in the pattern.
When the alignment window contains only acceptable characters, the window is a match candidate. • Acceptable: characters that appear in the pattern.
• Update step is simply D = D & B[ti+j-1].
21
EFS (Exact forward for small alphabets)AFL (Approximate Backward for small alphabets)
EFS: Update step is D D + M[ti] – M[ti-m].
AFL is modification of Mcount tuned for single pattern.
Different initial value of the counter.
22
ABS (Approximate Backward for Small Alphabets)
Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it.
Shift is utilizing array o[ ] which contains the positions of overflow bits.
23
Execution times of algorithms (in seconds) for English data
24
Execution times of algorithms (in seconds) for dna data
25
Execution times of algorithms (in seconds) for protein data
26
Experimental Results
English data• BAM2a works more than two times faster than the previous
algorithms.
DNA data• EFS works in a double speed an compared to previous
algorithms.
Protein data• BAM2a is fastest and takes less than half time compared to
previos agorithms.
27
Concluding remarks
We introduced new variations jumbled matching algorithms.
All the forward algorithms are clearly linear.
The speed of AFL do not depend on the value of k.
Technique of shared bins showed to be useful for jumbled matching.
28
THANK YOU
29