29
Tamanna Chhabra, Sukhpal Singh Ghuman , Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Embed Size (px)

Citation preview

Page 1: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio

Tuning Algorithms for Jumbeled Matching

Page 2: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Jumbled matching

Interesting variation of string matching.

To find substrings of T which are permutations of P.

For example: P=abcb in T=aababcaabc.

2

Page 3: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Jumbled matching

Parikh Vector- The pattern can be described as parikh vector.

Vector of multiplicities of the characters.

p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}.

3

Page 4: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Approximate Permutaion Matching

The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds

set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u.

4

Page 5: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Motivation

Alignment of strings

SNP discovery

Discovery of repeated patterns

Interpretation of mass spectrometry data

5

Page 6: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Previous Algorithms

Key Idea- scan the text forward while maintaining counts of characters.

Work in linear time.

These algorithms were developed as filtration methods for online approximate string matching.

6

Page 7: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Previous Algorithms

Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters.

Navarro’s counting algorithm - sliding window approach.

7

Page 8: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Previous Algorithms

Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters.

It grows with the acceptable characters.

Navarro presented a Mcount for multiple patterns (Proc. WSP 1997) .

8

Page 9: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Previous Algorithms

Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher).

Associate a counter(bin) to each distinct character in P.

A single 1-bit counter for the remaining characters of the alphabet.

9

Page 10: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Previous Algorithms

At the start of processing a window, every overflow bit is zero.

1-bit counter reserved for all the characters not occurring in p is initially null.

And it gets set as soon as any character not in p is encountered in the text window.

It becomes clear that the text window cannot be a permutation of the pattern P.

10

Page 11: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Bit Parallel simulation

P = abbccc

c b a other characters

11

Page 12: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Initialization for state vector

c

b a

P = abbccc

All other characters

12

Page 13: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Forward Processing

13

Page 14: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Backward Processing

14

Page 15: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

New solutions

Solutions for both exact and approximate jumbled matching.

We present two algorithms that are modifications of BAM.

ABAM (approximate BAM).

BAM2 (enhanced BAM with 2-grams).

15

Page 16: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Key Idea: Counters

We used bit fields to store counters.• For each character that appears in the pattern.

• One for all other characters.

Highest bit is an overflow indicator.

Space to represent number of times the character appears in the pattern + maximum error count k.

16

Page 17: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

State Vector D

Counters are stored in state vector D.

If they do not fit in one word• We can put several different characters in one field.

• But then we must verify matches.

Initial vales of D are fetched from precomputed word.

Processing of each character is made by using array M[tj] which has the one in the field for tj.

Value of D is updated by D D + M[tj].

17

Page 18: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Initialization for state vector D and M[ ] for pattern P = abbccc

0 0 0 0 0 0 1 0 0

abcAll other characters x

0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 1 0 0

0 0 0 1 0 0 0 0 0

1 0 0 0 0 0 0 0 0

M[a]

M[b]

M[c]

M[x]

I

18

Page 19: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Variations of BAM

BAMs• Some bins are shared if necessary.

• If bins are shared, each match candidate needs to be verified.

BAM2• Handles 2 text characters (2-gram) at a time.

• Separate loop for patterns of even and odd length.

• Reads four characters before testing D first time.

• Hence the minimum width of a field is four bits instead of two.

19

Page 20: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

ABAM

ABAM : Approximate BAM.

C is the error counter.

F[tj] is mask for testing overflow bits.

20

Page 21: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

EBL (Exact Backward for Large alphabets)

EBL is based on SBNDM2.

Instead of representing occurrence vectors.• Array B states of a character is present in the pattern.

When the alignment window contains only acceptable characters, the window is a match candidate. • Acceptable: characters that appear in the pattern.

• Update step is simply D = D & B[ti+j-1].

21

Page 22: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

EFS (Exact forward for small alphabets)AFL (Approximate Backward for small alphabets)

EFS: Update step is D D + M[ti] – M[ti-m].

AFL is modification of Mcount tuned for single pattern.

Different initial value of the counter.

22

Page 23: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

ABS (Approximate Backward for Small Alphabets)

Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it.

Shift is utilizing array o[ ] which contains the positions of overflow bits.

23

Page 24: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Execution times of algorithms (in seconds) for English data

24

Page 25: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Execution times of algorithms (in seconds) for dna data

25

Page 26: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Execution times of algorithms (in seconds) for protein data

26

Page 27: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Experimental Results

English data• BAM2a works more than two times faster than the previous

algorithms.

DNA data• EFS works in a double speed an compared to previous

algorithms.

Protein data• BAM2a is fastest and takes less than half time compared to

previos agorithms.

27

Page 28: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

Concluding remarks

We introduced new variations jumbled matching algorithms.

All the forward algorithms are clearly linear.

The speed of AFL do not depend on the value of k.

Technique of shared bins showed to be useful for jumbled matching.

28

Page 29: Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio Tuning Algorithms for Jumbeled Matching

THANK YOU

29