Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin

Tries

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

... But in practice…• Search: random memory accesses

• Space: len + pointers + strings

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory 2 limitations:

• Sampling rate ≈ lengths of sampled strings

• Trade-off ≈ speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets

An old idea: Patricia Trie

1

2 2

0

4

5

6

7

2 3

y

s1

z

stile zyg

5

etic

ial

ygy

aibelyite

czecin

omo

[Morrison, J.ACM 1968]

A new search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

sz

5

e

i

y

a

c

o

Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation

Three-phase search:P = syzyyea

01

2 5 g < y

P’s positionOnly 1 string is checked

Trie Space ≈ #strings, NOT their

length

[Ferragina-Grossi, J.ACM 1999]

….Locality Preserving Front Coding….

PTon all strings

2-level indexing

Disk

InternalMemory

A limitation is n < M

Typically 1 I/O

What about n > M

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PTSearch(P) •O((p/B) logB n) I/Os•O(occ/B) I/Os

It is dynamic...

1 string checked : O(p/B)

O(logB n) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

Knuth, vol 3°, pag. 489: “elegant”

GA

1

65

3

6

4

0

4 5 6 72 3

AGA GCGC

GGAGAG C

A G A G A

On Front-Coding…

AGAAGA5 G3 C0 GCGCAGA6 G4 GGA6 GA

Knuth

In-order visit+

Path covering

Front Coding

3

0

Compacted Trie =

FC + tree structureWhat about other traversals ?

FC + ... is searchable

Why pre-order visit

In Front-coding theLcp information is encoded many times

GA

1

65

3

6

4

0

4 5 6 72 3

AGA GCGC

GGAGAG C

A G A G A

3

0

AGAAGA1 G3 C4 GCGCAGA1 G3 GGA1 GA

RearCoding

Text Indexing

What do we mean by “Indexing” ?

Word-based indexes, here a notion of “word” must be devised !» Inverted lists, Signature files, Bitmaps.

Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, String B-tree,...

Basic notation and facts

Occurrences of P in T = All suffixes of T having P as a prefix

SUF(T) = Sorted set of suffixes of T

T = mississippi mississippi 4,7P = si

T[i,N]

iff P is a prefix of the i-th suffix of T (ie. T[i,N])

TPi

Pattern P occurs at position i of T

From substring searchTo prefix search

Reduction

The Suffix Tree

T# = mississippi# 2 4 6 8 10

12

11 8

5 2 1 10 9

7 4

6 3

0

4

#

i

ppi#

ssi

mississip

pi#

1

p

i# pi#

2

1

s

i

ppi#

ssippi#

3

si

ssippi#

ppi#

1

#

ppi#ssip

pi#

Label = <pos,len>

Space: #nodes

Search pattern P

Maximal repeatedsubstring = node

The Suffix Array

Prop 1. All suffixes in SUF(T) with prefix P are contiguous.

P=si

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

Suffix Array• SA: N log2 N) bits

• Text T: N chars In practice, a total of 5N bytes

SA

121185211097463

T = mississippi#

suffix pointer

5

Prop 2. Starting position is the lexicographic one of P.

Searching a pattern

Indirected binary search on SA: O(p) time per suffix cmp

T = mississippi#SA

121185211097463

P = si

P is larger

2 accesses per step

Searching a patternIndirected binary search on SA: O(p) time per suffix cmp

T = mississippi#SA

121185211097463

P = si

P is smaller

Suffix Array search• O(log2 N) binary-search steps

• Each step takes O(p) char cmp

overall, O(p log2 N) time

+[Manber-Myers, ’90]

Locating the occurrences

T = mississippi# 4 7SA

121185211097463

si#

occ=2

121185211097463

121185211097463 si$

Suffix Array search• O (p + log2 N + occ) time

where # < < $

sissippisippi

Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Text mining

• How long is the common prefix between T[i,...] and T[j,...] ?

• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=jLcp

01140010213

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

Lcp(7,3) = 1= min{2,1,3}


Text mining

• Does it exist a repeated substring of length ≥ L ?

• Maximal Lcp of a suffix is with its adjacent

• Search for Lcp[i] ≥ L

Lcp

01140010213


12

1185211097463

SA

Lcp

01140010213


Text mining

Does exist a substring of length ≥ L occurring ≥ C

times ?

• Exist ≥ C equal substrings of length ≥ L chars

• Exist ≥ C suffixes sharing a prefix of ≥ L chars

• These suffixes may be not contiguous, but...

• Their “block” has a common prefix of ≥ L

chars

• Search for Lcp[i,i+C-2] whose entries are

≥ L


12

1185211097463

SA

L = 1, C = 4

How to construct SA from T ?


12

1185211097463

SA

Elegant but inefficient

Obvious inefficiencies:• (n2 log n) time in the worst-case• (n log n) cache misses or I/O faults

Input: T = mississippi#

The skew algorithm The key problem: Compare efficiently two

suffixes Brute-force = (n) time per cmp, (n2 log n)

total

In order to sort the suffixes of S1. Divide the suffixes of S in two groups

S0,2 = suffixes starting at positions 0 mod 3 or 2 mod 3

S1 = suffixes starting at positions 1 mod 3

2a. Sort recursively S0,2 (they are 2n/3)

2b. Sort S1: suffix(3i+1) = S[3i+1] suff(3i+2)

3. Merge the sorted S0,2 with the sorted S1

T(n) = O(split) + T(2n/3) + O(|S1|) + O(merge) = O(n)

Sort recursively S0,2

We turn this problem into the SA-construction of a shorter string of length (2/3)n.

S=AAT GTG AGA TGA $$$

RadixSort all triplets that start at positions 0,2 mod

3 T = {ATG, TGT, TGA, GAG, GAT, ATG, GA$, A$$} Sort(T) = (A$$, ATG, GA$, GAG, GAT, TGA, TGT)

Assign lexicographic names (log n bits) A$$=1, ATG=2, GA$=3,…

Build s0,2 and encode it: ATG TGA GAT GA$ TGT GAG ATG A$$ 2 6 5 3 7 4 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13

Sort recursively S0,2

Given


We have built:

s0,2 = ATG TGA GAT GA$ TGT GAG ATG A$$ enc(s0,2) = 2 6 5 3 7 4 2 1

It is SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]

1 2 3 4 5 6 7 8 9 10 11 12 13

A suffix of s0,2

A suffix of enc(s0,2)

SA(enc(s0,2)) gives SA0,2

Lex-order ispreserved

Sort S1

We turn this problem into the sort of pairs


Key observation: suff(1) = <A, pos(2)> = <A,3>

suff(7) = <A, pos(8)> = <A,6>

SA0,2 = [12, 9, 2, 11, 6, 8, 5, 3]

Suffix of S1

1 2 3 4 5 6 7 8 9 10 11 12 13

SA1 = [1, 7, 4, 10]

The merge step

To merge suffix si in S0,2 with suffix sk in S1, note that

If (i mod 3) = 2 si+1 and sk+1 belong to S0,2

If (i mod 3) = 0 si+2 and sk+2 belong to S0,2

their order can be derived from SA0,2 in O(1) time

SA1 SA0,2

SA

T(n) = T(2n/3) + O(n) + O(merge) = O(n)

S=AAT GTG AGA TGA $$$1 2 3 4 5 6 7 8 9 10 11 12 13

Documents

Tries. (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin