Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Full-Text Indexingvia Burrows-Wheeler Transform

Wing-Kai HonOct 18, 2006

Outline

The Text Searching ProblemWhat is Full-Text Indexing?Burrows-Wheeler Transform (BWT)BWT as a Full-Text IndexRelated work

Text Searching

Text: acacaaccagtcacactagac……Pattern: acac

Where does the pattern occur in the text?

How fast can we search?

Let n be the length of text m be the length of pattern

We can find all positions that the pattern appears in O( n + m ) timeKnuth-Morris-Pratt, Boyer-Moore

Is O(n+m) time good?Yes, because it is optimal!

Text Searching (take 2)

Pattern: acac

Where does the pattern occur in the text?

Text: acacaaccagtcacactagac……

we know the text in advance and can preprocess it

Can we do better?

Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m + ) time, where = number of times the pattern appears in the text

Such a data structure is called an index Is O(m+) time useful?

Yes, if the text is very long and it is searched many times for different patterns

Full-Text Index

Full-Text IndexDeals with creating an index for a textAlso, each position in the text corresponds to

an appearance of at least one pattern (full)Word-Level Index

Text is a sequence of wordsThe positions within a word does not

correspond to appearance of any patternE.g., Text: Was it a cat I saw? (Pattern: “at”

does not have an appearance)

Suffix Tree:An Optimal Full-Text Index

As mentioned, we can create an index for the text such that pattern searching can be done in O(m+) timeThis time is optimal

One such index is the Suffix TreeIntroduced independently by E.

McCreight in 1976 and P. Weiner in 1973

Suffix and Suffix Tree

Given a string S, a substring of S that ends at the last position is called a suffix of S

If S consists of n chars, S has exactly n suffixes

Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j

E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2)

acaac# (start at pos 3)

caac# (start at pos 4)

aac# (start at pos 5)

ac# (start at pos 6)

c# (start at pos 7) # (start at pos 8)

Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S.

acacaac#acacaac# acacaac#

acaac#

acacaac#

ac#

Suffix and Suffix Tree (2)

The suffix tree is an edge-labeled compact tree (no degree-1 nodes) with n leaves such that each leaf corresponds to a suffixConcatenating edge labels along the path

from root to leaf gives the corresponding suffix

Edge-label to each child starts with different character

Example (next slide)

# c

c

aa#

# ca

# a

# ca

#ca

#

caac

#

caac

The Suffix Tree of acacaac#

8

5

3

6 42

7

1

Searching with Suffix Tree

To search P, we match P starting from the rootIf we can match P successfully in the

tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text

Then, we traverse the tree under the stop point to report where P appears

So, searching is done in O(m+) time

Is Suffix Tree good?

Yes, because optimal search timeNo, because of space requirement…

The space can be much larger than the text

E.g., Text = DNA of Human To store the text, we need 0.8 GbyteTo store the suffix tree, we need 64

Gbyte!

Something Wrong??

Both the suffix tree and the text has n things, so they both need O(n) space…

How come there is a big difference??Let us have a better analysis

Let A be the alphabet (i.e., the set of distinct characters) of a text TE.g., in DNA, A = {a,c,g,t}

Something Wrong?? (2)

To store T, we need only n log |A| bitsBut to store the suffix tree, we will need

n log n bitsWhen n is very large compared to |A|,

there is a huge difference

Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??

Burrows-Wheeler Transform

By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’

Example (next slide)

#

a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #

c

c

a

c

#a

a

a

BWT Suffix in sorted order

Text = acacaac#

BWT is useful

BWT is shown to be compressed more easily than the original text

Also, given the position in the BWT array where the last character appears, we can get back the original text

How?

#


c

c

a

c

#a

a

a


Text = acacaac#

#

a

a

a

a

c

c

c

Sorted BWT

BWT IndexFerragina and Manzini (2000) observes

that we can use BWT to support pattern searching by storing some additional O(n)-bit arrays

Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time

Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)

#


c

c

a

c

#a

a

a


Text = acacaac#, Pattern = aca

#

a

a

a

a

c

c

c

Sorted BWT

BWT Index

They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time

So, searching is done in O(m + log n) time

What is the space? O( n log |A| ) bits

Related Work

Further compress the indexSpace is now measured in terms of the

entropy (or the randomness) of a textSupport text with large alphabetEfficient Construction

Challenge is in minimizing working spaceMore complex queries and operations

Library problem, Dictionary problem

Pointers for Further Study

The Pizza & Chili websitehttp://pizzachili.di.unipi.it

The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000

The CSA paper by R. Grossi and J.S. Vitter, STOC 2000

Discuss with me ^_^ (email: wkhon@)

http://pizzachili.di.unipi.it/



Documents

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006