Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Substring statistics : Algorithm
0.0001
0.001
0.01
0.1
1
1e-006 1e-005 0.0001 0.001 0.01 0.1 1
df2/
df
df/N
NTCIR2G(any ngrams)
There are N(N-1)/2 substrings in a corpus of lenth N. Substring statistics are useful to find word boundary in Japanese
19
Kyoji Umemura and Kenneth W. Church CICLing 2009
April 14, 2009
How it works
0.0001
0.001
0.01
0.1
1
1e-006 1e-005 0.0001 0.001 0.01 0.1 1
df2/
df
df/N
NTCIR2G(any ngrams)
First prepare the table of all statistics. Then obtain each statistics using binary search
19
Result: PreparationO(N), Access: O(log(N))
Size of Table is O(N), not O(N^2) There are N(N-1)/2 substrings in a corpus of lenth N. But many of them have same statistics.
19
N(N-1)/2 is too large for memory.
Strings with same statistics are grouped into a class
Saving Preparation Time If each table may required O(N) computation to get value, the preparation time would be O(N^2) because table size (N).
19
obtaining statistics of a shorter substring using statistics of longer substring which share the begining part.
definition of cf(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus cf(x) : number of occurrences of string X in a corpus
cf (abc) =5 #4 #5 #6
#1 #2 #3
Ex.
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus cf(x) : number of occurrences of string (not word) x. The word "abc" contains a string "ab".
cf (abc) =5 #4 #5 #6
#1 #2 #3
Ex.
cf (ab) =11
definition of cf(x):x is string
definition of tf(d,x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus tf(d,x) : number of occurrence of string x in document # d
tf (1, abc) =2
#4 #5 #6
#1 #2 #3
ex.
definition of tf(d,x):x is string
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus tf(d,x) : number of occurrence of string x in document # d
tf (1, abc) =2
#4 #5 #6
#1 #2 #3
ex. tf (6, ab) =4
definition of df(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df(x) : number of documents which contain the string x, at least once.
df (abc) =4 #4 #5 #6
#1 #2 #3
ex.
definition of df(x): x is string
abc abc ab
ab abc ab
abc
ab ab abc ab
テキストコーパス df(x) : number of documents which contain the string x . The word "abc" contains a string "ab". df (abc) =4 #4 #5 #6
#1 #2 #3
ex. df (ab) =5
definition of df2(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df2 (x) : the number of documents which contain string x twice or more.
df2 (abc) =1
#4 #5 #6
#1 #2 #3
ex.
definition of df2(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df2 (x) : the number of documents which contain string x twice or more.
#4 #5 #6
#1 #2 #3
ex. df2 (ab) =3
definition of dfk(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus dfk(x) : the number of documents which contain string x k times or more.
#4 #5 #6
#1 #2 #3
ex. df3 (ab) =2
Suffix Array 1
a c a b d b b
c c b
a c b a c b b a c a b b a c a b b b a c a b d b b
c c b
a c b
a c b b
a c a b b
a c a b b b
a c a b d b b
suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]
sorted as dictionary
Suffix Array
Suffix Array 1
Suffix Array 2
a c a b d b b
c c b
a c b
a c b b
a c a b b
a c a b b b
a c a b d b b
suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]
A suffix is expressed by one integer
suffix[6] suffix[5]
suffix[2]
suffix[0] suffix[1]
suffix[3] suffix[4]
Space O(N )
Suffix Array 2
common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])
k=1 j -1
Classes: Set of strings which starts with same suffixes
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
The set of string, "aabb", "aabbc", and "aabbcc" is a class because they start with from suffix1 to suffix2 and not other suffix.
"aabb", "aabbc", and "aabbcc"
"aab"
"a" and "aa"
Members of a class shares statistics
For class C, x, y in C satifies following, because they start with same suffixes.
A df (x) = df (y) df2 (x) = df2 (y) dfn(x) = dfn(y)
cf (x) = cf (y)
tf (d,x) = tf (d,y)
A
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
lcp[0] = 2, since suffix[0] and suffix[1] have common prefix "aa", and the length of "aa" is 2.
LCP(length of Common Prefix)
common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])
k=1 j -1
Classes on Suffix Array
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
A class corresponds to an suffix interval which satisfies ... outgoing(1,4)=max(lcp[0],lcp[4])=2
inner(1,4) =min (lcp[k]) =3 k=1 3
[1,4]
[0,0]
[1,1]
[2,2]
[3,3]
[4,4]
[5,5]
[0,4]
[1,2]
[3,4]
Upper and Lower Class
Lower class
Upper class
if C1 contains all C2 suffixes, C1 is upper class of C2 C2 is lower class of C1
suffix array
Property of Class
Lower class
Upper class
Classes form one tree. Therefore total number of classes are less than 2N-1 where N is size of Corpus
suffix array
Corpus
#4 #5 #6
#1 #2 #3
cf (abc) cf (abx) cf (ab) =3+6
=3 =6
=9 abx
abc
abc
abc
abx
abx
abx
abx
abx
number of occurrence of a string are obtained by longer strings
Corpus
#4 #5 #6
#1 #2 #3
number of documents are not
cf (abc) =3 cf (abx) =6 cf (ab) =3+6
=9 df (abc) =3 df (abx) df (ab) =3+4?
=4
=5
abx
abc
abc
abc
abx
abx
abx
abx
abx
numbering string "ab" in a document by suffix order.
Document #1
abc
abd
abx
aby
suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]
a d b c a a
b a e
a y b a x b
abx suffixes of doc #1 abe
abd
abx aac
for string "ab", the numbering of suffix[89]
(abx..) is 3.
1
3
4
2
numbering string "a" in a document by suffix order.
Document #1
abc
abd
abx
aby
suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]
a d b c a a
b a e
a y b a x b
abx suffixes of doc #1 abe
abd
abx aac
for string "a", the numbering of suffix[89]
(abx..) is 4.
2
4
5
3
1
k-frequency of string x
cfk (x)
tfk (d,x)
:sum of tfk (d,x) in all d of corpus
: number of string x in document d, where numbering of x is k or larger.
For class C, x, y in C satifies following, because they start with same suffixes. tfk (d,x) = tfk (d,y)
Property
k-frequency and document frequency
document #1 document #2 document #3
ab
ab ab
ab
abc
abc
abc
ab
ab
ab
abc abc
abc abc
abc
abc abc
abc
ab
ab ab
dfk(x)=cfk(x)-cfk+1(x) property
cf8(ab)=1
ab ab
ab ab ab
ab ab
ab ab ab
ab ab
ab ab
ab
ab
ab ab
ab ab
ab
1 1 1 2 2
2 3 3 3 4 4
4 5 5
5 6 6
6 7 7 8
cf7(ab)=3
7 7 ab ab
cf6(ab)=6
ab ab ab
6
6
6
df7(ab)=cf7(ab)-cf8(ab)=3-1=2 df6(ab)=cf6(ab)-cf7(ab)=6-3=3
enumerating classes scanning suffix array and LCP
suffix array
suffix[0] suffix[1] suffix[2] suffix[3]
LCP for each suffix from begin to end
1. LCP increase : begining of class
…
stack
2. LCP decrease: end of class
assing start point , and lcp of class push class record to stack
pop class record assign end point of class
suffix[1] suffix[2]
detecting class
suffix[2]
detecting class
class record in stack incomplete class
suffix position
Note: needs more care for the classes sharing start or end
Data structure for k-frequency -document link-
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
Marking the numbering is 3 or more for "abcx", "abcy" and "abcz"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
Marking the numbering is 3 or more for "abc"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abc
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
Counting k-frequency n Deside Range by Document Link, n Count up variables in class in stack where start is smaller than range
Suffix Position
number of record in class
Range
Reducing computation by Class*
simple method : counting up variables in all incomplete classes where the class contains specified range.
→multiple counting for one suffix
There is a way to get same result by single counting with propagation.
Suffixes marked for "abcx" "abcy" and "abcz", are also marked in "abc"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
abc -1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
Ranges by document link show that they belongs to "abc", and not "abcx", "abcy"...
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
abc -1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
definition Class* of interval n The class which has smallest interval among
the classes which contains the interval. n i.e. The lowest class among the class which
contains the interval.
suffix position
number record instack
Range
Reducing computation by Class*
simple method : counting up variables in all incomplete classes where the class contains specified range.
noting the following property
Class forms a tree structure. if a suffix has number K for one class, the suffix has K or more number in upper class.
Property
counting up variables of class* of range, then propergate count to upper class
→multiple counting for one suffix
How to find Class* n Class* is a class on stack n where its start is smaller than range start
and closest one
suffix position
number of stack in a class
Range
How to find Class* n using binary search, search class of
appropriate start position. Note: class in stack are ordered by start.
suffix position
number of stack in a class
Range
Preparation
abcabcabc abcd abcde bcde
Corpus class[4,8] df=3 df2=1 S=“abc” class[5,6] df=1 df2=1 S=“abcabc” class[7,8] df=2 df2=0 S=“abcd” class[9,14] df=4 df2=1 S=“bc” class[10,11] df=1 df2=1 S=“bcabc” ・・・
Tables
df=3 df2=1 S=“abc”
abc, ab, a class[4,8]
class[5,6] df=1 df2=1 S=“abcabc”
abca,abcab, abcabc
class[7,8] df=2 df2=0 S=“abcd”
abcd
class[9,14] df=4 df2=1 S=“bc”
b, bc
bca,bcab, bcabc class[10,11]
df=1 df2=1 S=“bcabc”
Sort the class table by dictionary order of the longest string of Class
Access to table
“abc” ??
abcabcabc abcd abcde bcde
コーパス class[4,8] df=3 df2=1 S=“abc” class[5,6] df=1 df2=1 S=“abcabc” class[7,8] df=2 df2=0 S=“abcd” class[9,14] df=4 df2=1 S=“bc” class[10,11] df=1 df2=1 S=“bcabc” ・・・
df=3 df2=1 S=“abc”
abc, ab, a class[4,8]
class[5,6] df=1 df2=1 S=“abcabc”
abca,abcab, abcabc
class[7,8] df=2 df2=0 S=“abcd”
abcd
class[9,14] df=4 df2=1 S=“bc”
b, bc
bca,bcab, bcabc class[10,11]
df=1 df2=1 S=“bcabc”
df=3 df2=1
bBinary search to find first class that maches query. note: other ordering is also possible
Conclusion
n Various document frequencies can be computed by K-frequency of string
n K-frequecy of short string is sum of K-frequency of longer strings, that is not the case for document frequency.
n K-frequency is efficiently computed using the suffix array and document link.
Download & Try
n http://www.ss.cs.tut.ac.jp/umemura/cicling2009/
n Try ./show-class for your favorite string. n suggesting short string with patterns n verify the various frequencies