51
Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

Classification and clustering methods development and implementation for unstructured documents collections

  • Upload
    ivo

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

Classification and clustering methods development and implementation for unstructured documents collections. by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology. Contents. Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: Classification and clustering methods development and implementation for unstructured documents collections

Classification and clustering methods development and implementation for unstructured documents collections

byOsipova Nataly

St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology

Page 2: Classification and clustering methods development and implementation for unstructured documents collections

Contents

IntroductionMethods descriptionInformation Retrieval SystemExperiments

Page 3: Classification and clustering methods development and implementation for unstructured documents collections

Contextual Document Clustering

was developed in joined project ofApplied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.

Page 4: Classification and clustering methods development and implementation for unstructured documents collections

Definitions

DocumentTerms dictionaryDictionaryClusterWord contextContext or document conditional

probability distributionEntropy

Page 5: Classification and clustering methods development and implementation for unstructured documents collections

Document conditional probability distribution

Document x

yword1 word2 word3 …wordn

tf(y)5106

16

p(y|x)5/m10/m6/m

16/m

y – wordstf(y) – y frequencyp(y|x) – y conditional probability in document xm – document x size

(5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution

Page 6: Classification and clustering methods development and implementation for unstructured documents collections

Word context

Word wDocument x1 Document x2 Document xk

yword1 word2 …wordn1

tf(y)510

16

p(y|x1)5/m110/m1

16/m1

yword1 word3 …wordn2

tf(y)712

4

p(y|x1)7/m112/m1

4/m1

yword1 word4 …wordnk

tf(y)209

3

p(y|x1)20/mk9/mk

3/mk

yword1 word2 word3 …wordnk

tf(y)5+7+20=321012

3

p(y|w)32/m10/m12/m

3/m

Context conditional probability distribution

Page 7: Classification and clustering methods development and implementation for unstructured documents collections

Contents

IntroductionMethods descriptionInformation Retrieval SystemExperiments

Page 8: Classification and clustering methods development and implementation for unstructured documents collections

Methods

document clustering methoddictionary build methodsdocument classification method using

training set

Information retrieval methods:keyword search methodcluster based search methodsimilar documents search method

Page 9: Classification and clustering methods development and implementation for unstructured documents collections

Contextual Documents Clustering

Documents Dictionary Narrow context words

Clusters

Distances calculation

Page 10: Classification and clustering methods development and implementation for unstructured documents collections

Entropy

(HH

n

i

pipi1

)log(*)

p1 pnp2

y context conditional probability distribution

p1+p2+…+pn=1

p1 pnp2

Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.

Page 11: Classification and clustering methods development and implementation for unstructured documents collections

Contextual Document Clustering

maxH(y)=H (

)

Page 12: Classification and clustering methods development and implementation for unstructured documents collections

Entropy

α0 10.5

)2(log2 1, 21 pp

)loglog(]),([ 221121 ppppppH

H( ) H( ) H( )

Page 13: Classification and clustering methods development and implementation for unstructured documents collections

Word Context - Document Distance

1p

2p

21 21

21 ppp

y context conditional probability distribution

Document x conditional probability distribution

Average conditional probability distribution

Page 14: Classification and clustering methods development and implementation for unstructured documents collections

Word Context - Document Distance

JS[p1,p2]=H( )

- 0.5H( )

- 0.5H( )

Page 15: Classification and clustering methods development and implementation for unstructured documents collections

Jensen-Shannon divergence

210]2,1[

0]2,1[

},{

},{

21

21

21

21

ppppJS

ppJS

Page 16: Classification and clustering methods development and implementation for unstructured documents collections

Dictionary construction

Why:- big volumes: 60,000 documents, 50,000 words => 15,000

words in a context- narrow context words importance

Page 17: Classification and clustering methods development and implementation for unstructured documents collections

Dictionary construction

Delete words with1. High or low frequency2. High or low document frequency3. 1. and 2.

Page 18: Classification and clustering methods development and implementation for unstructured documents collections

Retrieval algorithms

keyword search methodcluster based search methodsearch by example method

Page 19: Classification and clustering methods development and implementation for unstructured documents collections

Keyword search method

Document 1word 1word 2word 3…word n1

Document 2word 10word 25word 30…word n2

Document 3word 15word 2word 32…word n3

Document 4word 11word 21word 3…word n4

Request: word 2 Result set: document 1document3

Page 20: Classification and clustering methods development and implementation for unstructured documents collections

Cluster based search method

Documents

Cluster 3word 1word 23…word n3

Documents Documents

Cluster 2word 12word 26…word n2

Cluster 1word 1word 2…word n1

Cluster context words

Request: word 1 Result set: Cluster 1Cluster 3

Page 21: Classification and clustering methods development and implementation for unstructured documents collections

Similar documents search

document 1Cluster name

Cluster

Minimal Spanning Tree

document 2

document 3

document 4

document 5

document 6

document 7

Request: document 3

Result set: document 6document 7

Page 22: Classification and clustering methods development and implementation for unstructured documents collections

Document classification: method 1

Clusters List of topics Training set

Topics contexts

Distances between topics and clusters contexts

Classification result:cluster1 – topic 10cluster 2 – topic 3

…cluster n – topic 30

Test documents

Page 23: Classification and clustering methods development and implementation for unstructured documents collections

Clusters

Topics listTraining set

Classification result:cluster1 – topic 10cluster 2 – topic 3

…cluster n – topic 30

Document classification: method 2

Test documents

All documents set

Page 24: Classification and clustering methods development and implementation for unstructured documents collections

Contents

IntroductionMethods descriptionInformation Retrieval SystemExperiments

Page 25: Classification and clustering methods development and implementation for unstructured documents collections

Information Retrieval System

ArchitectureFeaturesUse

Page 26: Classification and clustering methods development and implementation for unstructured documents collections

Information Retrieval System architecture.

data base serverclient

Page 27: Classification and clustering methods development and implementation for unstructured documents collections

IRS architecture

Data Base

Data Base ServerMS SQL Server 2000

Local AreaNetwork

“thick” clientC#

Page 28: Classification and clustering methods development and implementation for unstructured documents collections

IRS architecture

DBMS MS SQL Server 2000:High-performanceScalableSecureHuge volumes of data treatT/SQLStored procedures

Page 29: Classification and clustering methods development and implementation for unstructured documents collections

IRS features

In the IRS the following problems are solved:document clusteringkeyword search methodcluster based search methodsimilar documents search methoddocument classification with the use of

training set

Page 30: Classification and clustering methods development and implementation for unstructured documents collections

DB structure

The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization

Page 31: Classification and clustering methods development and implementation for unstructured documents collections

DictionaryDocuments

Table “document-word”

Words contexts

Clusters CentroidCluster based search

Keyword search

Words with narrow contexts

All words dictionary

Similar documents search

Algorithms implementation

Page 32: Classification and clustering methods development and implementation for unstructured documents collections

document1document2

document5 document3

document4

Cluster

0,16285

0,98154

0,57231

0,23851

0,26967

0,211

0,87310,7231

0,1011

Similar documents search

Page 33: Classification and clustering methods development and implementation for unstructured documents collections

Minimal Spanning Tree

document 1

Cluster name

Cluster

document 2

document 3

document 4

document 5

Page 34: Classification and clustering methods development and implementation for unstructured documents collections

Similar documents search

Clusterstable Tree tableDistances

table

Similar documents

search

Page 35: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 36: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 37: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 38: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 39: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 40: Classification and clustering methods development and implementation for unstructured documents collections

IRS use

Page 41: Classification and clustering methods development and implementation for unstructured documents collections

Contents

IntroductionMethods descriptionInformation Retrieval SystemExperiments

Page 42: Classification and clustering methods development and implementation for unstructured documents collections

Experiments

Test goals were:algorithm accuracy testdifferent classification methods

comparisonalgorithm efficiency evaluation

Page 43: Classification and clustering methods development and implementation for unstructured documents collections

Experiments

60,000 documents100 topicsTraining set volume = 5% of the

collection size

Page 44: Classification and clustering methods development and implementation for unstructured documents collections

Experiments

1000)(,2)( ydfydf

1000)(,5)( ytfytf

Page 45: Classification and clustering methods development and implementation for unstructured documents collections

Result analysis

- Russian Information Retrieval Evaluation Seminar

- Such measures as macro-average recallprecision F-measure were calculated.

Page 46: Classification and clustering methods development and implementation for unstructured documents collections

Recall

textan

xxxxxxxxxxxx

xxxx

xxxx

xxxxxxxx

xxxx

xxxx

0

0.1

0.2

0.3

0.4

0.5

0.6

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

Recall

Page 47: Classification and clustering methods development and implementation for unstructured documents collections

Precision

xxxx

xxxxxxxx

xxxx xxxx

xxxx

textanxxxxxxxx

xxxx

00.10.20.30.40.50.60.7

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

Precision

Page 48: Classification and clustering methods development and implementation for unstructured documents collections

F-measure

textan

xxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

xxxx

xxxx

00.050.1

0.150.2

0.250.3

0.35

Systems

textan

xxxx

xxxx

xxxx

xxxx

xxxx

xxxx

F-measure

Page 49: Classification and clustering methods development and implementation for unstructured documents collections

Result analysis

List of some topicstest documents were classified in

№ Category

1 Family law

2 Inheritance law

3 Water industry

4 Catering

5 Inhabitants’ consumer services

6 Rent truck

7 International law of the space

8 Territory in international law

9 Off-economic relations fellows

10 Off-economic dealerships

11 Economy free trade zones. Customs unions.

Page 50: Classification and clustering methods development and implementation for unstructured documents collections

Result analysis

Recall results for every category.Results which were the best for the category are selected with bold type.All results are set in percents.

СV 1 2 3 4 5 6 7 8 9 10 11

textan 33 34 35 60 46 26 27 98 75 25 100

xxxx 1 0 0.2 3 4 0 0.9 0 3 0 2

xxxx 0 0 4.3 2.3 0 5 0.9 8 3 0 0.8

xxxx 55 86 75 19 59 51 80 0 41 82 0

xxxx 21 39 2 22 15 6 0 1.4 0 5 0

xxxx 40 43 16 11 25 23 10 1.4 1.2 5 0

xxxx 23 4 2.5 1.1 18 7 0.9 0 1.2 10 0

xxxx 2.7 0 0 0 1.5 0 0 0 0 0 0

xxxx 2.2 0 0 0 1.5 0 0 0 0 0 0

xxxx 37 21 12 22 18 27 51 0 0 0 0

Page 51: Classification and clustering methods development and implementation for unstructured documents collections

Thank you for your attention!