Classification and clustering methods development and implementation for unstructured documents collections

Classification and clustering methods development and implementation for unstructured documents collections

byOsipova Nataly

St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

Contents

IntroductionMethods descriptionInformation Retrieval SystemExperiments

Contextual Document Clustering

was developed in joined project ofApplied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

Definitions

DocumentTerms dictionaryDictionaryClusterWord contextContext or document conditional

probability distributionEntropy

Document conditional probability distribution

Document x

yword1 word2 word3 …wordn

tf(y)5106

16

p(y|x)5/m10/m6/m

16/m

y – wordstf(y) – y frequencyp(y|x) – y conditional probability in document xm – document x size

(5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

Word context

Word wDocument x1 Document x2 Document xk

yword1 word2 …wordn1

tf(y)510

16

p(y|x1)5/m110/m1

16/m1

yword1 word3 …wordn2

tf(y)712

4

p(y|x1)7/m112/m1

4/m1

yword1 word4 …wordnk

tf(y)209

3

p(y|x1)20/mk9/mk

3/mk

…

yword1 word2 word3 …wordnk

tf(y)5+7+20=321012

3

p(y|w)32/m10/m12/m

3/m

…

Context conditional probability distribution

Contents


Methods

document clustering methoddictionary build methodsdocument classification method using

training set

Information retrieval methods:keyword search methodcluster based search methodsimilar documents search method

Contextual Documents Clustering

Documents Dictionary Narrow context words

Clusters

Distances calculation

Entropy

(HH

n

i

pipi1

)log(*)

p1 pnp2

y context conditional probability distribution

p1+p2+…+pn=1

p1 pnp2

Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

Contextual Document Clustering

maxH(y)=H (

)

Entropy

α0 10.5

)2(log2 1, 21 pp

)loglog(]),([ 221121 ppppppH

H( ) H( ) H( )

Word Context - Document Distance

1p

2p

21 21

21 ppp

y context conditional probability distribution

Document x conditional probability distribution

Average conditional probability distribution

Word Context - Document Distance

JS[p1,p2]=H( )

- 0.5H( )

- 0.5H( )

Jensen-Shannon divergence

210]2,1[

0]2,1[

},{

},{

21

21

21

21

ppppJS

ppJS

Dictionary construction

Why:- big volumes: 60,000 documents, 50,000 words => 15,000

words in a context- narrow context words importance

Dictionary construction

Delete words with1. High or low frequency2. High or low document frequency3. 1. and 2.

Retrieval algorithms

keyword search methodcluster based search methodsearch by example method

Keyword search method

Document 1word 1word 2word 3…word n1




Request: word 2 Result set: document 1document3

Cluster based search method

Documents

Cluster 3word 1word 23…word n3

Documents Documents



Cluster context words

Request: word 1 Result set: Cluster 1Cluster 3

Similar documents search

document 1Cluster name

Cluster

Minimal Spanning Tree

document 2

document 3

document 4

document 5

document 6

document 7

Request: document 3

Result set: document 6document 7

Document classification: method 1

Clusters List of topics Training set

Topics contexts

Distances between topics and clusters contexts

Classification result:cluster1 – topic 10cluster 2 – topic 3

…cluster n – topic 30

Test documents

Clusters

Topics listTraining set

Classification result:cluster1 – topic 10cluster 2 – topic 3

…cluster n – topic 30

Document classification: method 2

Test documents

All documents set

Contents


Information Retrieval System

ArchitectureFeaturesUse

Information Retrieval System architecture.

data base serverclient

IRS architecture

Data Base

Data Base ServerMS SQL Server 2000

Local AreaNetwork

“thick” clientC#

IRS architecture

DBMS MS SQL Server 2000:High-performanceScalableSecureHuge volumes of data treatT/SQLStored procedures

IRS features

In the IRS the following problems are solved:document clusteringkeyword search methodcluster based search methodsimilar documents search methoddocument classification with the use of

training set

DB structure

The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization

DictionaryDocuments

Table “document-word”

Words contexts

Clusters CentroidCluster based search

Keyword search

Words with narrow contexts

All words dictionary


Algorithms implementation

document1document2

document5 document3

document4

Cluster

0,16285

0,98154

0,57231

0,23851

0,26967

0,211

0,87310,7231

0,1011


Minimal Spanning Tree

document 1

Cluster name

Cluster

document 2

document 3

document 4

document 5


Clusterstable Tree tableDistances

table

Similar documents

search

IRS use

IRS use

IRS use

IRS use

IRS use

IRS use

Contents


Experiments

Test goals were:algorithm accuracy testdifferent classification methods

comparisonalgorithm efficiency evaluation

Experiments

60,000 documents100 topicsTraining set volume = 5% of the

collection size

Experiments

1000)(,2)( ydfydf

1000)(,5)( ytfytf

Result analysis

- Russian Information Retrieval Evaluation Seminar

- Such measures as macro-average recallprecision F-measure were calculated.

Recall

textan

xxxxxxxxxxxx

xxxx

xxxx

xxxxxxxx

xxxx

xxxx

0

0.1

0.2

0.3

0.4

0.5

0.6

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

Recall

Precision

xxxx

xxxxxxxx

xxxx xxxx

xxxx

textanxxxxxxxx

xxxx

00.10.20.30.40.50.60.7

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

Precision

F-measure

textan

xxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

xxxx

xxxx

00.050.1

0.150.2

0.250.3

0.35

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

F-measure

Result analysis

List of some topicstest documents were classified in

№ Category

1 Family law

2 Inheritance law

3 Water industry

4 Catering

5 Inhabitants’ consumer services

6 Rent truck

7 International law of the space

8 Territory in international law

9 Off-economic relations fellows

10 Off-economic dealerships

11 Economy free trade zones. Customs unions.

Result analysis

Recall results for every category.Results which were the best for the category are selected with bold type.All results are set in percents.

СV 1 2 3 4 5 6 7 8 9 10 11

textan 33 34 35 60 46 26 27 98 75 25 100

xxxx 1 0 0.2 3 4 0 0.9 0 3 0 2

xxxx 0 0 4.3 2.3 0 5 0.9 8 3 0 0.8

xxxx 55 86 75 19 59 51 80 0 41 82 0

xxxx 21 39 2 22 15 6 0 1.4 0 5 0

xxxx 40 43 16 11 25 23 10 1.4 1.2 5 0

xxxx 23 4 2.5 1.1 18 7 0.9 0 1.2 10 0

xxxx 2.7 0 0 0 1.5 0 0 0 0 0 0

xxxx 2.2 0 0 0 1.5 0 0 0 0 0 0

xxxx 37 21 12 22 18 27 51 0 0 0 0

Thank you for your attention!