Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...

Substring statistics : Algorithm

0.0001

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

NTCIR2G(any ngrams)

There are N(N-1)/2 substrings in a corpus of lenth N. 　Substring statistics are useful to find word boundary in Japanese

１９

Kyoji Umemura and Kenneth W. Church CICLing 2009

April　１４,　２００９

How it works

0.0001

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

NTCIR2G(any ngrams)

First prepare the table of all statistics. Then obtain each statistics using binary search

１９

Result: PreparationO(N), Access: O(log(N))

Size of Table is O(N), not O(N^2) There are N(N-1)/2 substrings in a corpus of lenth N. But many of them have same statistics.

１９

N(N-1)/2 is too large for memory.

Strings with same statistics are grouped into a class

Saving Preparation Time If each table may required O(N) computation to get value, the preparation time would be O(N^2) because table size (N).

１９

obtaining statistics of a shorter substring using statistics of longer substring which share the begining part.

definition of cf(x)

abc abc ab

ab abc ab

ab ab abc ab

Corpus cf(x) : number of occurrences of string X in a corpus

cf (abc) =5 #4 #5 #6

#1 #2 #3

abc abc ab

ab abc ab

ab ab abc ab

Corpus cf(x) : number of occurrences of string (not word) x. The word "abc" contains a string "ab".

cf (abc) =5 #4 #5 #6

#1 #2 #3

cf (ab) =11

definition of cf(x)：x is string

definition of tf(d,x)

abc abc ab

ab abc ab

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

definition of tf(d,x):x is string

abc abc ab

ab abc ab

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

ex. tf (6, ab) =4

definition of df(x)

abc abc ab

ab abc ab

ab ab abc ab

Corpus df(x) : number of documents which contain the string x, at least once.

df (abc) =4 #4 #5 #6

#1 #2 #3

definition of df(x): x is string

abc abc ab

ab abc ab

ab ab abc ab

テキストコーパス df(x) : number of documents which contain the string x . The word "abc" contains a string "ab". df (abc) =4 #4 #5 #6

#1 #2 #3

ex. df (ab) =5

definition of df2(x)

abc abc ab

ab abc ab

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

df2 (abc) =1

#4 #5 #6

#1 #2 #3

definition of df2(x)　

abc abc ab

ab abc ab

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

#4 #5 #6

#1 #2 #3

ex. df2 (ab) =3

definition of dfk(x)　

abc abc ab

ab abc ab

ab ab abc ab

Corpus dfk(x) : the number of documents which contain string x k times or more.

#4 #5 #6

#1 #2 #3

ex. df3 (ab) =2

Suffix Array １

a c a b d b b

a c b a c b b a c a b b a c a b b b a c a b d b b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

sorted as dictionary

Suffix Array

Suffix Array １

Suffix Array ２

a c a b d b b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

A suffix is expressed by one integer

suffix[6] suffix[5]

suffix[2]

suffix[0] suffix[1]

suffix[3] suffix[4]

Space 　O(N )

Suffix Array ２

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes: Set of strings which starts with same suffixes

c d b a

a c b a a

b c d b a a c e b c d

a a b e e d a a b e e e e

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

The set of string, "aabb", "aabbc", and "aabbcc" is a class because they start with from suffix1 to suffix2 and not other suffix.

"aabb", "aabbc", and "aabbcc"

"a" and "aa"

Members of a class shares statistics

For class C, x, y in C satifies following, because they start with same suffixes.

A df (x) = df (y) df2 (x) = df2 (y) dfn(x) = dfn(y)

cf (x) = cf (y)

tf (d,x) = tf (d,y)

c d b a

a c b a a

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

lcp[0] = 2, since suffix[0] and suffix[1] have common prefix "aa", and the length of "aa" is 2.

LCP(length of Common Prefix)

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes on Suffix Array

c d b a

a c b a a

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

A class corresponds to an suffix interval which satisfies ... outgoing(1,4)=max(lcp[0],lcp[4])=2

　　　　 inner(1,4) =min (lcp[k]) =3 k=1 3

Upper and Lower Class

Lower class

Upper class

if C1 contains all C2 suffixes, C1 is upper class of C2 C2 is lower class of C1

suffix array

Property of Class

Lower class

Upper class

Classes form one tree. Therefore total number of classes are less than 2N-1 where N is size of Corpus

suffix array

Corpus

#4 #5 #6

#1 #2 #3

cf (abc) cf (abx) cf (ab) =3+6

=9 abx

number of occurrence of a string are obtained by longer strings

Corpus

#4 #5 #6

#1 #2 #3

number of documents are not

cf (abc) =3 cf (abx) =6 cf (ab) =3+6

=9 df (abc) =3 df (abx) df (ab) =3+4？

numbering string "ab" in a document by suffix order.

Document　#１

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

a y b a x b

abx suffixes of doc #1 abe

abx aac

for string "ab", the numbering of suffix[89]

(abx..)　is 3.

numbering string "a" in a document by suffix order.

Document　#１

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

a y b a x b

abx suffixes of doc #1 abe

abx aac

for string "a", the numbering of suffix[89]

(abx..)　is 4.

k-frequency of string x

cfk (x)

tfk (d,x)

:sum of tfk (d,x) in all d of corpus

: number of string x in document d, where numbering of x is k or larger.

For class C, x, y in C satifies following, because they start with same suffixes. tfk (d,x) = tfk (d,y)

Property

k-frequency and document frequency

document　#1 document　#2 document　#3

abc abc

dfk(x)=cfk(x)-cfk+1(x) property

cf8(ab)=1

ab ab ab

1 1 1 2 2

2 3 3 3 4 4

6 7 7 8

cf7(ab)=3

7 7 ab ab

cf6(ab)=6

ab ab ab

df7(ab)=cf7(ab)-cf8(ab)=3-1=2 df6(ab)=cf6(ab)-cf7(ab)=6-3=3

enumerating classes scanning suffix array and LCP

suffix array

suffix[0] suffix[1] suffix[2] suffix[3]

LCP for each suffix from begin to end

1. LCP increase : begining of class

2. LCP decrease: end of class

assing start point , and lcp of class push class record to stack

pop class record assign end point of class

suffix[1] suffix[2]

detecting class

suffix[2]

detecting class

class record in stack incomplete class

suffix position

Note: needs more care for the classes sharing start or end

Data structure for k-frequency -document link-

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

abcz abcz

abcx abcx

12301 12308

12345 12357

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12313 12305

12456 12494

abcz abcz

abcz abcy

abcy abcx

Marking the numbering is 3 or more for "abcx", "abcy" and "abcz"

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

abcz abcz

abcx abcx

12301 12308

12345 12357

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12313 12305

12456 12494

abcz abcz

abcz abcy

abcy abcx

Marking the numbering is 3 or more for "abc"

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

abcz abcz

abcx abcx

12301 12308

12345 12357

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12313 12305

12456 12494

abcz abcz

abcz abcy

abcy abcx

Counting k-frequency n  Deside Range by Document Link, n  Count up variables in class in stack where start is smaller than range

Suffix Position

number of record in class

Reducing computation by Class*

simple method　:　counting up variables in all incomplete classes where the class contains specified range. 　　　　　　　

→multiple counting for one suffix

There is a way to get same result by single counting with propagation.

Suffixes marked for "abcx" "abcy" and "abcz", are also marked in "abc"

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

abcz abcz

abcx abcx

12301 12308

12345 12357

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abc -1

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12313 12305

12456 12494

abcz abcz

abcz abcy

abcy abcx

Ranges by document link show that they belongs to "abc", and not "abcx", "abcy"...

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

abcz abcz

abcx abcx

12301 12308

12345 12357

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abc -1

abcx abcy

abcz abcy abcy

abcx abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12313 12305

12456 12494

abcz abcz

abcz abcy

abcy abcx

definition Class* of interval n  The class which has smallest interval among

the classes which contains the interval. n  i.e. The lowest class among the class which

contains the interval.

suffix position

number record instack

Reducing computation by Class*

simple method　:　counting up variables in all incomplete classes where the class contains specified range. 　　　　　　　

noting the following property

Class forms a tree structure. if a suffix has number K for one class, the suffix has K or more number in upper class.

Property

counting up variables of class* of range, then propergate count to upper class

→multiple counting for one suffix

How to find Class* n  Class* is a class on stack n  where its start is smaller than range start

and closest one

suffix position

number of stack in a class

How to find Class* n  using binary search, search class of

appropriate start position. Note: class in stack are ordered by start.

suffix position

number of stack in a class

Preparation

abcabcabc abcd abcde bcde

Corpus class[4,8] 　 df=3 df2=1 S=“abc” class[5,6] 　df=1 df2=1 S=“abcabc” class[7,8] 　df=2 df2=0 S=“abcd” class[9,14] 　 df=4 df2=1 S=“bc” class[10,11] 　df=1 df2=1 S=“bcabc” ・・・

Tables

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8] 　df=2 df2=0 S=“abcd”

class[9,14] df=4 df2=1 S=“bc”

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

Sort the class table by dictionary order of the longest string of Class

Access to table

“abc” ??

abcabcabc abcd abcde bcde

コーパス class[4,8] 　 df=3 df2=1 S=“abc” class[5,6] 　df=1 df2=1 S=“abcabc” class[7,8] 　df=2 df2=0 S=“abcd” class[9,14] 　 df=4 df2=1 S=“bc” class[10,11] 　df=1 df2=1 S=“bcabc” ・・・

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8] 　df=2 df2=0 S=“abcd”

class[9,14] df=4 df2=1 S=“bc”

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

df=3 df2=1

bBinary search to find first class that maches query. note: other ordering is also possible

Conclusion

n  Various document frequencies can be computed by K-frequency of string

n  K-frequecy of short string is sum of K-frequency of longer strings, that is not the case for document frequency.

n  K-frequency is efficiently computed using the suffix array and document link.

Download & Try

n  http://www.ss.cs.tut.ac.jp/umemura/cicling2009/

n  Try ./show-class for your favorite string. n  suggesting short string with patterns n  verify the various frequencies

Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...

Documents

Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

On the k -Closest Substring and k -Consensus Pattern Problems

B A N Q U E T M E N U · Coconut, cacao and date energy balls (v,gf,df) Organic nuts and seeds with sundried goji berries (v,gf,df) Bam! smoothies, with berry, acai, mango (v,gf,df)

Introduction Df blDeformable MiMirror FEM · Introduction Df blDeformable MiMirror FEM calculations Load force: 10 N Annular Zernike coefficients (rms ) Pi t ( ) 782 (rms.) Piston

The Formwork Experts. Attachable drive unit DF - Doka · The Formwork Experts. Attachable drive unit DF Art. n°: 586062000 | 1999 models onward ... Push the Shifting trolley DF in

Scanned by CamScanner · DER - DF DETRAN - DF DER - DF DER - DF DER - DF DER - DF DETRAN - DF DETRAN - DF DE-TRAN - DF DER - DF ... Seguro Obrígatório e Multas. Segue extrato financeiro

Computing longest common substring and all palindromes from compressed strings

n (8 September, Wednesday, 2021) n k[i7 * n k|b]z @ df

New Implementation of vdW-DF functional in ABINIT · 2017. 5. 18. · vdW-DF, Deﬁnition and implementation Exc[n(r)] = EGGA x [n(r)] + ELDA c [n(r)] + Enl c [n(r)] Where Enl c =

Machine Translation without Words through Substring …27 Machine Translation without Words through Substring Alignment Experimental Setup EuroParl KFTT de fi fr en ja en TM 2.56M

Multi-column Substring Matching for Database Schema Translation

Substring Search

Features · 2019. 9. 24. · DF FLUID COOLING | Mobile DF Series AIR COOLED DF Features n Same as DH with DC Fan n 3/4” Tube Size n Low AMP Draw 12 or 24 Volt DC Motors n Heavy

Efï¬cient Top-k Algorithms for Approximate Substring Matching

Slender PUF Protocol Authentication by Substring Matching

Substring-based Machine Translation - Graham Neubig

Longest Palindromic Substring

String Sorts Tries Substring Search: KMP, BM, RK

Multi-column Substring Matching For Database Schema ...lenzerin/DASI-School/materiale... · Multi-column Substring Matching For Database Schema Translation (And other wild thoughts

Longest Common Substring - · PDF file2 1. Background The longest common substring problem is a classic problem in string analysis. In 1970 the famous computer scientist Donald E