Bayesian Networks in Document Clustering

Bayesian Networks in Document Bayesian Networks in Document ClusteringClustering

Slawomir Wierzchon , Mieczyslaw KlopotekMichal Draminski Krzysztof Ciesielski

Mariusz Kujawiak

Institute of Computer Science, Polish Academy of SciencesWarsaw

Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using

Bayesian networks and artificial immune systems"

A search engine with SOM-based document set

representation

Map visualizations in 3D (BEATCA)

........

INTERNET

DBREGISTRY

HT-Base

HT-Base

VEC-BaseMAP-Base

DocGR-Base

Search Engine

Indexing +Optimizing

SpiderDownloading

MappingClustering

of docs

........

CellGR-Base

Clusteringof cells

........

........ ........ ........

Processing Flow Diagram - BEATCA

The preparation of documents is done by an indexer, which turns a document into a vector-space model representation

Indexer also identifies frequent phrases in document set for clustering and labelling purposes

Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded

The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation

‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

Document model in search engines

In the so-called vector model a document is considered as a vector in space spanned by the words it contains.

dogfood

walk

My dog likes this food

When walking, I take some food

Clustering document vectors

Document space 2D map

mxr

Mocna zmiana położenia (gruba

strzałka)

Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Our problem

Instability Pre-defined major themes needed

Our approach– Find a coarse clustering into a few themes

Bayesian Networks in Document Clustering

SOM document-map based search engines require initial document clustering in order to present results in a meaningful way.

Latent semantic Indexing based methods appear to be promising for this purpose.

One of them, the PLSA, has been empirically investigated.

A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.

A Bayesian Network

chappiR

dog

owner

food

Meat

walk

Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph

High compression,

Simpliofication of reasoning

.

BN application in text processing

Document classification Document Clustering Query Expansion

Hidden variable approaches

PLSA (Probabilistic Latent Semantic Analysis) PHITS (Probabilistic Hyperlink Analysis) Combined PLSA/PHITS Assumption of a hidden variable expressing the

topic of the document. The topic probabilistically influence the

appearence of the document (links in PHITS, terms in PLSA)

PLSA - concept N be term-document matrix of

word counts, i.e., Nij denotes how often a term (single word or phrase) ti occurs in document dj.

probabilistic decomposition into factors zk (1 k K)

P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints

Σi P(ti|zk) = 1 for all k and Σk P(zk| dj) = 1 for all j.

D Z

T1

T2

Tn

.....

Hidden variable

PLSA - concept PLSA aims at maximizing L:=

Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as

states of a latent mixing variable associated with each observation (i.e., word occurrence),

Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L.

...

..

D Z

T1

T2

Tn

Hidden variable

different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge

EM algorithm – step 0Data:D Z T1 T2 ... Tn1 ? 1 0 ... 12 ? 0 0 ... 13 ? 1 1 ... 14 ? 0 1 ... 15 ? 1 0 ... 0..........

Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........

Z randomly initialized

EM algorithm – step 1

Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........

BN trained D Z

T1

T2

Tn

Hidden variable

EM algorithm step 2

Data:D Z T1 T2 ... Tn1 2 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 2 0 1 ... 15 1 1 0 ... 0..........

Z sampled from BN

D Z

T1

T2

Tn

Hidden variable

GOTO step 1 untill convergence (Z assignment „stable”)

Z sampled for each record according to the probability distribution

P(Z=1|D=d,T1=t1,...,Tn=tn)P(Z=2|D=d,T1=t1,...,Tn=tn)....

The problem

Too high number of adjustable variables Pre-defined clusters not identified Long computation times instability

Solution

Our suggestion

– Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class”

We were successful – Up to five classes well clustered– High speed (with 20,000 documents)

Next step

Naive bayes assumes document and term independence

What if they are in fact dependent?

Our solution:– TAN APPROACH– First we create a BN of terms/documents– Then assume there is a hidden variable

Promissing results, need a deeper study

PLSA – a model with term TAN

D1

Z

T6T5

T4

Hidden variable

D2

Dk

T1

T2T3

PLSA – a model with document TAN

T1

Z

Hidden variable

T2

Ti

D6D5

D4

D1

D2D3

Documents

Bayesian Networks in Document Clustering