View
221
Download
7
Embed Size (px)
Citation preview
Full-Text Indexingvia Burrows-Wheeler Transform
Wing-Kai HonOct 18, 2006
Outline
The Text Searching ProblemWhat is Full-Text Indexing?Burrows-Wheeler Transform (BWT)BWT as a Full-Text IndexRelated work
Text Searching
Text: acacaaccagtcacactagac……Pattern: acac
Where does the pattern occur in the text?
How fast can we search?
Let n be the length of text m be the length of pattern
We can find all positions that the pattern appears in O( n + m ) timeKnuth-Morris-Pratt, Boyer-Moore
Is O(n+m) time good?Yes, because it is optimal!
Text Searching (take 2)
Pattern: acac
Where does the pattern occur in the text?
Text: acacaaccagtcacactagac……
we know the text in advance and can preprocess it
Can we do better?
Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m + ) time, where = number of times the pattern appears in the text
Such a data structure is called an index Is O(m+) time useful?
Yes, if the text is very long and it is searched many times for different patterns
Full-Text Index
Full-Text IndexDeals with creating an index for a textAlso, each position in the text corresponds to
an appearance of at least one pattern (full)Word-Level Index
Text is a sequence of wordsThe positions within a word does not
correspond to appearance of any patternE.g., Text: Was it a cat I saw? (Pattern: “at”
does not have an appearance)
Suffix Tree:An Optimal Full-Text Index
As mentioned, we can create an index for the text such that pattern searching can be done in O(m+) timeThis time is optimal
One such index is the Suffix TreeIntroduced independently by E.
McCreight in 1976 and P. Weiner in 1973
Suffix and Suffix Tree
Given a string S, a substring of S that ends at the last position is called a suffix of S
If S consists of n chars, S has exactly n suffixes
Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j
E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2)
acaac# (start at pos 3)
caac# (start at pos 4)
aac# (start at pos 5)
ac# (start at pos 6)
c# (start at pos 7) # (start at pos 8)
Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S.
acacaac#acacaac# acacaac#
acaac#
acacaac#
ac#
Suffix and Suffix Tree (2)
The suffix tree is an edge-labeled compact tree (no degree-1 nodes) with n leaves such that each leaf corresponds to a suffixConcatenating edge labels along the path
from root to leaf gives the corresponding suffix
Edge-label to each child starts with different character
Example (next slide)
# c
c
aa#
# ca
# a
# ca
#ca
#
caac
#
caac
The Suffix Tree of acacaac#
8
5
3
6 42
7
1
Searching with Suffix Tree
To search P, we match P starting from the rootIf we can match P successfully in the
tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text
Then, we traverse the tree under the stop point to report where P appears
So, searching is done in O(m+) time
Is Suffix Tree good?
Yes, because optimal search timeNo, because of space requirement…
The space can be much larger than the text
E.g., Text = DNA of Human To store the text, we need 0.8 GbyteTo store the suffix tree, we need 64
Gbyte!
Something Wrong??
Both the suffix tree and the text has n things, so they both need O(n) space…
How come there is a big difference??Let us have a better analysis
Let A be the alphabet (i.e., the set of distinct characters) of a text TE.g., in DNA, A = {a,c,g,t}
Something Wrong?? (2)
To store T, we need only n log |A| bitsBut to store the suffix tree, we will need
n log n bitsWhen n is very large compared to |A|,
there is a huge difference
Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??
Burrows-Wheeler Transform
By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’
Example (next slide)
#
a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #
c
c
a
c
#a
a
a
BWT Suffix in sorted order
Text = acacaac#
BWT is useful
BWT is shown to be compressed more easily than the original text
Also, given the position in the BWT array where the last character appears, we can get back the original text
How?
#
a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #
c
c
a
c
#a
a
a
BWT Suffix in sorted order
Text = acacaac#
#
a
a
a
a
c
c
c
Sorted BWT
BWT IndexFerragina and Manzini (2000) observes
that we can use BWT to support pattern searching by storing some additional O(n)-bit arrays
Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time
Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)
#
a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #
c
c
a
c
#a
a
a
BWT Suffix in sorted order
Text = acacaac#, Pattern = aca
#
a
a
a
a
c
c
c
Sorted BWT
BWT Index
They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time
So, searching is done in O(m + log n) time
What is the space? O( n log |A| ) bits
Related Work
Further compress the indexSpace is now measured in terms of the
entropy (or the randomness) of a textSupport text with large alphabetEfficient Construction
Challenge is in minimizing working spaceMore complex queries and operations
Library problem, Dictionary problem
Pointers for Further Study
The Pizza & Chili websitehttp://pizzachili.di.unipi.it
The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000
The CSA paper by R. Grossi and J.S. Vitter, STOC 2000
Discuss with me ^_^ (email: wkhon@)