Swaps + Mismatches Based on Estrella Eizenberg M.Sc. Thesis Supervised by Ely Porat

Swaps + MismatchesSwaps + Mismatches

Based on Estrella EizenbergBased on Estrella Eizenberg

M.Sc. ThesisM.Sc. Thesis

Supervised by Supervised by Ely PoratEly Porat

Swaps + MismatchesSwaps + Mismatches

A paper on this subject by A paper on this subject by

Amihood Amir, Estrella Eizenberg, Ohad Lipsky Amihood Amir, Estrella Eizenberg, Ohad Lipsky and Ely Poratand Ely Porat

Was submitted to ESA 2004Was submitted to ESA 2004

Problem definitionProblem definition

T: a d b d a c b d a b c a b

d a b b a a b c

Mismatches:

Abrahamson 87

K-mismatchesLandau Vishkin 86Amir Lewenstein Porat 00



d c a b d b a c

Swaps:

Amir Aumann Landau M.Lewenstein N.Lewenstein 87

Cole Hariharan 00

Amir Cole Hariharan Lewenstein Porat 2001

Amir Lewenstein Porat 2000



d c a b b b a c

Minimum distance:

Counting all as mismatches: 5 err

Minimum distance: 3 err

Starting with simpler problemStarting with simpler problem

={0,1}

T: 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1

1 0 0 1 0 1 0

We wish to count only the mismatches

(we will leave the swaps for later) we call them non-swap-mismatches (NSM)


={0,1}

T: 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1

1 0 0 1 0 1 0NSM[6]=2

Mismatches[6]=4

Minimum-distance[6]=(Mismatches[6]+NSM[6])/2

3-err

O(nlogm)

????

O(????+nlogm)


T: 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1

T1: 0 1 0 1 0 1 * * * 1 0 1 0 * *

T2: * * * * * * 1 0 1 * * * * 0 1


P2: 1 0* * * * *

P1: * * 0 1 0 1 0

We do the same for the pattern

We will give solution only for the odd places

(NSM[i] where i is odd)

P: 1 0 0 1 0 1 0


P2: 1 0* * * * *

P1: * * 0 1 0 1 0

T1: 0 1 0 1 0 1 * * * 1 0 1 0 * *

T1 comparing with P1 doesn’t give any err neither swap nor mismatch (the same is for T2 against P2)

Without loss of generality we look only on T1 against P2


P2: 1 0* * * * * T1: 0 1 0 1 0 1 * * * 1 0 1 0 * *

P2: 1 0* * * * * 1

0

Even overlap Odd overlap

We need to count how many odd overlaps we have

One NSM err

Simpler problemSimpler problem

We separate the sequence to 4 categories:

1. Starting at odd position and ending at odd position (called OO)

2. Starting at odd position and ending at even position (called OE)

3. Starting at even position and ending at odd position (called EO)

4. Starting at even position and ending at even position (called EE)


O

O

T

P

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1-1 1 -1 1 -1 1 -1

The overlap muststart with 1

1-1 1 -1 1 -1 1 -1 1-1 1 -1 1 -1 1 -1

1-1 1 -1 1 -1 1 -1 1-1 1 -1 1 -1 1 -1 1-1 1 -1 1 -1 1 -1 1 P

O(nlogm) – one convolution


O

O

T

P

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-1 1 -1 1 -1 1 -1 1

The overlap muststart with 1

-1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1

1-1 1 -1 1 -1 1 -1 1-1 1 -1 1 -1 1 -1 1-1 1 -1 1 -1 1 -1 1 P

O(nlogm) – one convolution

O O

O


We deal with: O? against O? We deal with: O? against O? and with ?O against ?Oand with ?O against ?O

The same method work for E? against E?The same method work for E? against E?and ?E against ?Eand ?E against ?E

We left to deal with: We left to deal with: – OE against EOOE against EO– EO against OEEO against OE– OO against EEOO against EE– EE against OOEE against OO

OO against EEOO against EE

O

E

T

P

P

E E

E

O

EEE

E Even overlap

Odd overlap

We need to recognized when the segment contain one other

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1-1 1-1 1-1 1-1 1 1-1 1-1 1-1 1-1 11-1 1-1 1-1 1-1 1

1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1 1-1

0 1

-1


We can easily know if we are contained or We can easily know if we are contained or we contain another segments if we know the we contain another segments if we know the segment size.segment size.

Smaller segments can’t contain larger Smaller segments can’t contain larger segmentssegments


Then for each segment we divide the Then for each segment we divide the computation against bigger segmentcomputation against bigger segmentand against smaller segmentsand against smaller segments

We do it by computing the answer each time We do it by computing the answer each time to all segments of size ‘x’to all segments of size ‘x’


The number of different sizes is at most The number of different sizes is at most square root of msquare root of m

What we haveWhat we have

We have an algorithm for the Simpler We have an algorithm for the Simpler problem that run in time O(n\sqrt{m}\logm)problem that run in time O(n\sqrt{m}\logm)

We have an algorithm for binary alphabet We have an algorithm for binary alphabet that run in O(n\sqrt{m}\logm)that run in O(n\sqrt{m}\logm)

With several more techniques we develop With several more techniques we develop an algorithm solving the original problem in an algorithm solving the original problem in O(n\sqrt{m}\logm)O(n\sqrt{m}\logm)

Open problemOpen problem

It is easy to see that our algorithm is at most It is easy to see that our algorithm is at most factor of O(\sqrt{\logm}) from the optimalfactor of O(\sqrt{\logm}) from the optimalalgorithm (due to redaction to counting algorithm (due to redaction to counting mismatches)mismatches)

But one can try to improve the small But one can try to improve the small alphabet casealphabet case

Documents

Swaps + Mismatches Based on Estrella Eizenberg M.Sc. Thesis Supervised by Ely Porat