View
220
Download
0
Category
Tags:
Preview:
Citation preview
Ayat A.Dawood 1
Fine Tuning the Enhanced Suffix ArraysAyat A.DawoodCIS, Nile UniversityJoined work with: Mohamed AbouelHoda
Ayat A.Dawood 2
Table of Contents
Suffix array The enhanced suffix array Our accomplishment:
Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table
representation
Ayat A.Dawood 3
Suffix array
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab I
aaacatat$ 2 0
aacatat$ 3 1
acaaacatat$ 0 2
acatat$ 4 3
atat$ 6 4
at$ 8 5
caaacatat$ 1 6
catat$ 5 7
tat$ 7 8
t$ 9 9
$ 10 10
Ayat A.Dawood 4
Suffix array
Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab I
aaacatat$ 2 0
aacatat$ 3 1
acaaacatat$ 0 2
acatat$ 4 3
atat$ 6 4
at$ 8 5
caaacatat$ 1 6
catat$ 5 7
tat$ 7 8
t$ 9 9
$ 10 10
Ayat A.Dawood 5
Enhanced suffix array
Basically it is the suffix array enhanced with a set of tables.
Using those tables, best performance and complexity are achieved
lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1].
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
6
Enhanced suffix array: l-interval
L-interval: interval of suffixes sharing the same prefix
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
1-[0..5]
7
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
1-[0..5]
2-[0..1]
a
L-interval: interval of suffixes sharing the same prefix
8
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
L-interval: interval of suffixes sharing the same prefix
Ayat A.Dawood 9
Our accomplishment
Improvement (Fine Tuning): Alphabet-independent exact pattern
matching. Improving bucket table representation Improving access to the lcp-table.
Improvements are achieved using minimal perfect hashing techniques.
Ayat A.Dawood 10
Minimal perfect hashing(MPHF)
Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al]
Look up table requires O(|U|) space to achieve constant access time
11
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
e.g., pattern = aca
12
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
e.g., pattern = aca
13
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
e.g., pattern = aca
14
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
ac
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
e.g., pattern = aca
Ayat A.Dawood 15
Exact pattern matching problem
Using normal method: takes O(nm) Using the enhanced suffix arrays, it
can be achieved in O(|∑|m) [AbouElHoda et. al]
Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)). [Kim et. al],[Fischer et. al]
Ayat A.Dawood 16
Exact pattern matching problem
Our work: Using minimal perfect hashing
technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
MPHF table
MPHF table
Ayat A.Dawood 17
Exact pattern matching problem
Our work: Using minimal perfect hashing
technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
Ayat A.Dawood 18
Exact pattern matching problem
Our work: Using minimal perfect hashing
technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
ac
c t
t
Ayat A.Dawood 19
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
Bucket table
0 aa
2 ac
4 at
ag
6 ca
ct
cc
cg
8 ta
tc
tg
tt
ga
gt
gc
gg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 20
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0
aacatat$ 2 3 1
acaaacatat$
1 0 2
acatat$ 3 4 3
atat$ 1 6 4
at$ 2 8 5
caaacatat$
0 1 6
catat$ 2 5 7
tat$ 0 7 8
t$ 1 9 9
$ 0 10 10
Bucket table
0 aa
2 ac
4 at
ag
6 ca
ct
cc
cg
8 ta
tc
tg
tt
ga
gt
gc
gg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 21
Improving the bucket table representation cont’
Problem: Space consumption of the look up table
is prohibitive for large d and ∑ (d ^ |∑|). Solution:
Use minimal perfect hashing techniques to store the look up table.
Ayat A.Dawood 22
Improving the bucket table representation cont’
Results: For the bacterial ecoli genome (size =
5400 bp) and for d= 12
Reduction comparing to lookup table
MPHF size in
bits
Lookup table
size in bits
No. of keys
Alphabet size
46% reduction 7231956.638
1677216 3474814
4 (A,T,C,G)
93% reduction 17590331.64
244140625
8451811
5(A,T,C,G,*N)*N for undefined nucleotide or dummy
character
Ayat A.Dawood 23
Conclusion
Exact pattern matching problem Improving the bucket table
representation. Improving access to the lcp-table.
Ayat A.Dawood 24
Questions???
Ayat A.Dawood 25
Improving access to the lcp-table To reduce space, lcp- table is
stored in 1 byte. If a common prefix is longer
than 255, then it is stored in another table.
To access this table, it is accessed sequential or using binary search
Our Enhancement: Use MPHF to store the extra
table to access it in constant time.
0
2
3
2
0
257
279
300
260
lcp-table
Extra lcp-table
Recommended