A Chinese Information Retrieval System Using SDD

A Chinese Information RetrievalSystem Using SDD

Introduction SDD algorithmSDD algorithm Compute SDD Term Weighting Computing the SimilarityComputing the Similarity Term Extracting System Implementation Experiments and Results Web applications Conclusion

Introduction

People need to find information People need to find information quickly and accuratelyquickly and accurately today today

Search engines help people to find information they need from the hugSearch engines help people to find information they need from the huge data collection.e data collection.

Search engines are based on information retrieval models.Search engines are based on information retrieval models.

Traditional search engines can not get the good precision and the ratio Traditional search engines can not get the good precision and the ratio of recall in the search.of recall in the search.

Introduction

Vector Space ModelVector Space Model(VSM)(VSM) was advocated to improve the precision was advocated to improve the precision and the ratio of recall in searching. and the ratio of recall in searching.

VSM is to represent individual documents and queries in a collectiVSM is to represent individual documents and queries in a collection as a vector in on as a vector in a multi dimensionala multi dimensional space. space.

Latent Semantic IndexingLatent Semantic Indexing (LSI) (LSI) is is an improvement modelan improvement model of VSM. of VSM.

Singular Value DecompositionSingular Value Decomposition ( (SVD)SVD) is widely used in LSI is widely used in LSI.

Introduction

SVD has been used quite effectively for information SVD has been used quite effectively for information retrieval. retrieval.

SVD is much more expansive to compute for a large SVD is much more expansive to compute for a large database collection.database collection.

We adopt a different matrix approximation called We adopt a different matrix approximation called Semi Semi Discrete Decomposition (SDD).Discrete Decomposition (SDD).

SDD algorithm

A matrix is showing as following. be the number of term, and be tA matrix is showing as following. be the number of term, and be the number of document. The number of rows is greater than or equal to its nuhe number of document. The number of rows is greater than or equal to its number of columns , mber of columns ,

So the SDD of matrix of dimension k is:So the SDD of matrix of dimension k is:

nmmm

n

n

nm

aaa

aaa

aaa

A

21

22212

12111

NM M

NM

N

Tkkkk YDXA

SDD algorithm

We can also extend the equation as followingWe can also extend the equation as following:

where is an m-vector, is an n-vector. The entries of and are where is an m-vector, is an n-vector. The entries of and are from the set of . And is a diagonal matrix. This equation is from the set of . And is a diagonal matrix. This equation is called a k-term SDD. called a k-term SDD.

Since a k-term SDD needs only k floating point numbers plus k(m+n) Since a k-term SDD needs only k floating point numbers plus k(m+n) entries from S for storage. It is inexpensive to compute quite a large entries from S for storage. It is inexpensive to compute quite a large number of terms.number of terms.

Tii

k

1ii

Tk

2

1

k

2

1

yxd

y

y

y

d00

0d0

00d

kk xxxA 21

ix iy ix iy 1,0,1S id

Compute SDD

There are three steps for computing an SDD approximation:There are three steps for computing an SDD approximation:

1. Let be the k-term approximation, be the residual at the th step.1. Let be the k-term approximation, be the residual at the th step.

2. As the sub problem, solve the triplet 2. As the sub problem, solve the triplet 　　　　　　　　 solution with minimizes.solution with minimizes.

This is a mixed integer programming problem, it can be solved as below:This is a mixed integer programming problem, it can be solved as below:

(a) Fixed y.(a) Fixed y.(b) Solve the equation above for x and d using this y.(b) Solve the equation above for x and d using this y.(c) Solve the equation above for y and d using the x from step (b).(c) Solve the equation above for y and d using the x from step (b).(d) Repeat until convergence criterion is satisfied.(d) Repeat until convergence criterion is satisfied.

3. Repeat the step 2 until 3. Repeat the step 2 until

2

F

Tkk dxyRy,x,dFmin 0,, dSySx mm

1 kk AARkA kR k

ki

kkk yxd ,,

Term Weighting

In vector space model, term weighting is a very important and has great In vector space model, term weighting is a very important and has great influence on a success of the retrieval system. influence on a success of the retrieval system.

A matrix , . We define that is the term weighting of A matrix , . We define that is the term weighting of term term

in document as following:in document as following:

It consists of three components, is a global weight of term , is the It consists of three components, is a global weight of term , is the local weight of the term in the document , and is a normalization local weight of the term in the document , and is a normalization factor for the document .factor for the document .

jijiij dtga

ig ijti

jjdj

ANM ijaA ija ij

i

Term Weighting

The weighting scheme is usually specified by a six-letter combination that indicates The weighting scheme is usually specified by a six-letter combination that indicates local, global, and normalization components for the term document matrix. local, global, and normalization components for the term document matrix.

we specify the weighting scheme as lxn.afx, and the weighting formulas can be calwe specify the weighting scheme as lxn.afx, and the weighting formulas can be calculated as following: culated as following:

otherwise

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 otherwise

0

))1(log()(1(log(1

2

12m

k kjijijffa

0

))(

log()1log(1

n

k ik

i

i f

nf

q

0ijf

0ijf

Computing the Similarity

The similarity between the document and query vector is calculated by tThe similarity between the document and query vector is calculated by the cosine coefficient. Bellow is the he cosine coefficient. Bellow is the formulaformula using to compute the similar using to compute the similarity:ity:

the document can be arranged in descending order of similarity and the the document can be arranged in descending order of similarity and the number of documents retrieved can be limited.number of documents retrieved can be limited.

n

ii i

n

ii i

n

i ii

dq

dqdq

22

1),cos(

Term Extracting

LSI model LSI model is easy to use corpus in different languages to accomplisis easy to use corpus in different languages to accomplish the cross language retrievalh the cross language retrieval. .

We choose a morphological analysis system called ChaSen(We choose a morphological analysis system called ChaSen( 茶筅) t) to extract the Chinese word from document o extract the Chinese word from document in our system.in our system.

We need good Dictionary to separate the Chinese word correctly. We need good Dictionary to separate the Chinese word correctly.

Term Extracting

Figure 1: Chinese Morphological analyzerFigure 1: Chinese Morphological analyzer

System Implementation Below is an illustration showing the working mechanism of our Below is an illustration showing the working mechanism of our SDD information retrieval systemSDD information retrieval system:

Documents Documents CollectionCollection

Document Document VectorsVectors

Dictionary Dictionary VectorsVectors

Query stringQuery string

Query VectorQuery Vector

Rank relevant document Rank relevant document in descending order of in descending order of similaritysimilarity

SDD SDD ComputationComputation

System Implementation

Implement the SDD information retrieval system as following:Implement the SDD information retrieval system as following:

1. Segment the terms from document collection.Segment the terms from document collection.

2. Create the term-document matrix in MatrixMarket Create the term-document matrix in MatrixMarket Coordinate Format.Coordinate Format.

3. Using SDDPACK to compute the term-document matrix SDDPACK to compute the term-document matrix decompositiondecomposition. The command is as below:The command is as below:

$decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd$decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd

4. Ranking the relevant document.Ranking the relevant document.

System Implementation

%%MatrixMarket matrix coordinate real general %%MatrixMarket matrix coordinate real general 8 6 208 6 201 1 4.110885e-011 1 4.110885e-012 1 3.692579e-012 1 3.692579e-013 1 4.464557e-013 1 4.464557e-014 1 3.692579e-014 1 3.692579e-015 1 4.110885e-015 1 4.110885e-016 1 1.590307e-016 1 1.590307e-017 1 3.180615e-017 1 3.180615e-018 1 2.520578e-018 1 2.520578e-013 2 1.000000e+003 2 1.000000e+003 3 7.263057e-013 3 7.263057e-015 3 6.873719e-015 3 6.873719e-013 4 3.162278e-013 4 3.162278e-015 4 9.486833e-015 4 9.486833e-013 5 6.666667e-013 5 6.666667e-015 5 6.666667e-015 5 6.666667e-018 5 3.333333e-018 5 3.333333e-011 6 5.000000e-011 6 5.000000e-013 6 5.000000e-013 6 5.000000e-01

%% Semidiscrete Decomposition (SDD)%% Semidiscrete Decomposition (SDD)%% Matrix: sdddata/matrix Terms: 5 Accr: 0.%% Matrix: sdddata/matrix Terms: 5 Accr: 0.00e+00 Tol: 1.00e-02 InnIts: 100 Init: 100e+00 Tol: 1.00e-02 InnIts: 100 Init: 15 8 65 8 65.7245558500289916992187500e-015.7245558500289916992187500e-012.7197235822677612304687500e-012.7197235822677612304687500e-014.0811389684677124023437500e-014.0811389684677124023437500e-012.5439447164535522460937500e-012.5439447164535522460937500e-011.3001415133476257324218750e-011.3001415133476257324218750e-01 0 0 1 0 1 0 0 00 0 1 0 1 0 0 0 1 1 0 1 0 0 1 11 1 0 1 0 0 1 1 0 0 1 0 -1 0 0 00 0 1 0 -1 0 0 0 1 -1 0 -1 0 0 -1 11 -1 0 -1 0 0 -1 1 1 1 -1 1 -1 1 0 01 1 -1 1 -1 1 0 0 1 1 1 1 1 11 1 1 1 1 1 1 0 0 0 0 11 0 0 0 0 1 0 1 0 -1 0 00 1 0 -1 0 0 0 0 0 0 0 10 0 0 0 0 1 1 0 0 0 0 01 0 0 0 0 0

Figure 2. 8 x 6 matrix outputFigure 2. 8 x 6 matrix output Figure 3. SDD outputFigure 3. SDD output

Experiments and Results

We selected a small data set in which have only 100 We selected a small data set in which have only 100 documents to do a test. The data is Chinese text-base documents to do a test. The data is Chinese text-base documents coming from the web page of Chinese documents coming from the web page of Chinese Agricultural University.Agricultural University.

For comparing the performance of SDD and SVD, we For comparing the performance of SDD and SVD, we compute the matrix decomposition using both SDD and compute the matrix decomposition using both SDD and SVD. SVD.

Create query vector, compute the Similarity and rank the Create query vector, compute the Similarity and rank the document in descending order of similarity.document in descending order of similarity.

Experiments and Results

1 22 0.8451102 3 0.1861893 49 0.1644034 4 58 58 0.1574440.1574445 56 0.1488116 1 0.1398917 7 9 9 0.1050010.1050018 69 0.0678638 69 0.0678639 9 31 31 0.0577630.05776310 23 0.056919

1 22 0.7419982 1 0.4015683 49 0.3990594 3 0.3981775 58 0.3975716 23 0.3961997 9 0.3940328 12 0.3910859 14 0.38959010 31 0.380483

Figure 4. Top ten entries in SVDFigure 4. Top ten entries in SVD Figure 5. Top ten entries Figure 5. Top ten entries

in SDDin SDD

Web Applications

We developed a web-based application for the presentation of this ChinWe developed a web-based application for the presentation of this Chinese information retrieval systems.ese information retrieval systems.

Visiting this side by using the address of Visiting this side by using the address of http://pc110.narc.affrc.go.jp/Chinese/..

We also developed a Japanese system using SDD-base VSM. The web iWe also developed a Japanese system using SDD-base VSM. The web interface shows at the address of nterface shows at the address of http://pc110.narc.affrc.go.jp/AgrInfo/..

Web Applications

Web Applications

Web Applications

Web Applications

Web Applications

Conclusion

We presented a Chinese Information retrieval system by using We presented a Chinese Information retrieval system by using SDD. SDD.

SDD has good advantage in saving storage of computer resSDD has good advantage in saving storage of computer resourcesources..

SDD will be easy to implement for a big data collection.SDD will be easy to implement for a big data collection. SDD will be easy to accomplish the cross language retrievaSDD will be easy to accomplish the cross language retrieva

l. l. SDD has almost the same retrieval performance compared SDD has almost the same retrieval performance compared

with SVD. with SVD.

　 Thank you!