59
28/06/22 Nikos Hourdakis, MSc Thesis 1 Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

  • Upload
    cleave

  • View
    56

  • Download
    9

Embed Size (px)

DESCRIPTION

Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method. Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering. Motivation. Large document collections in many applications. - PowerPoint PPT Presentation

Citation preview

Page 1: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 1

Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method

Nikolaos Hourdakis

Technical University of CreteDepartment of Electronic and Computer Engineering

Page 2: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 2

Motivation

Large document collections in many applications. Digital libraries, Web

There is additional interest in methods for more effective management of information. Abtraction, Browsing, Classification, Retrieval

Clustering is the means for achieving better organization of information. The data space is partitioned into groups of entities

with similar content.

Page 3: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 3

Outline

Background State-of-the-art clustering approaches Partitional, hierarchical methods

K-Means and its variants Incremental K-Means, Bisecting Incremental K-Means

Proposed method: BIC-Means Bisecting Incremental K-Means using BIC as stopping

criterion. Evaluation of clustering methods Application in Information Retrieval

Page 4: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 4

Hierarchical Clustering (1/3)

Nested sequence of clusters. Two approaches:

A. Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster.

B. Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters.

Page 5: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 5

Hierarchical Clustering – Example (2/3)

..

.. .. .. .. ..... .. ... ..

.. . ....

5

46

7

23

1

4

1

2 3

5 6 7

Page 6: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 6

Hierarchical Clustering (3/3)

Organization and browsing of large document collections call for hierarchical clustering but:

Agglomerative clustering have quadratic time complexity.

Prohibitive for large data sets.

Page 7: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 7

Partitional Clustering

We focus on Partitional Clustering K-Means, Incremental K-Means, Bisecting K-Means

At least as good as hierarchical. Low complexity, O(KN) Faster than hierarchical for large document

collections.

Page 8: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 8

K-Means

1. Randomly select K centroids

2. Repeat ITER times or until the centroids do not change:

a) Assign each instance to the cluster whose centroid it is closest.

b) Re-compute the cluster centroids.

Generates a flat partition of K Clusters (K must be known in advance).

Centroid is the mean of a group of instances.

Page 9: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 9

K-Means Example

.

...

. ..

.

.

.

.

.......

.

.

..

xx

C

C C

Page 10: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 10

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Page 11: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 11

K-Means demo (2/7)

Page 12: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 12

K-Means demo (3/7)

Page 13: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 13

K-Means demo (4/7)

Page 14: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 14

K-Means demo (5/7)

Page 15: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 15

K-Means demo (6/7)

Page 16: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 16

K-Means demo (7/7)

Page 17: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 17

Comments No proof of convergence Converges to a local minimum of the distortion

measure (average of the square distance of the points from their nearest centroids):

ΣiΣd(d-μc)2

Too slow for practical databases K-means fully deterministic once initial centroids

selected. Bad choice of initial centroids leads to poor

clusters.

Page 18: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 18

Incremental K-Means (IK)

In K-Means new centroids are computed after each iteration (after all documents have been examined).

In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster:

S

dCSC

1

'

Page 19: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 19

Comments

Not as sensitive as K-Means to the selection of initial centroids.

Faster convergence, much faster in general

Page 20: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 20

Bisecting IK-Means (1/4)

A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. The documents are initially partitioned into two

clusters. The algorithm iteratively selects and bisects each

one of the leaf clusters until singleton clusters are reached.

Page 21: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 21

Bisecting IK-means (2/4)

Input: (d1,d2…dN) Output: hierarchy of K-clusters

1. All document in cluster C2. Apply IK-means to split C into K clusters

(K=2) C1,C2,…CK leaf clusters

3. Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs

Page 22: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 22

Bisecting IK-Means (3/4)

The algorithm is exhaustive terminating at singleton clusters (unless K is known)

Terminating at singleton clusters Is time consumingSingleton clusters are meaninglessIntermediate clusters are more likely to correspond

to real classes

No criterion for stopping bisections before singleton clusters are reached.

Page 23: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 23

Bayesian Information Criterion (BIC) (1/3)

To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached.

Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978].

X-Means [Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values.

Page 24: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 24

Bayesian Information Criterion (BIC) (2/3)

In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not.

It measures the improvement of the cluster structure between a cluster and its two children clusters.

We compute the BIC score of: A cluster and of its Two children clusters.

Page 25: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 25

Bayesian Information Criterion (BIC) (3/3)

If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. We keep the parent cluster as it is.

Otherwise, we accept the split and the algorithm proceeds similarly to lower levels.

Page 26: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 26

Example

The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection.

Two resultingclusters:BIC(K=2)=2245

1C Parent cluster:BIC(K=1)=1980

1C 2C

C

Page 27: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 27

Computing BIC

The BIC score of a data collection is defined as (Kass and Wasserman, 1995):

where is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points.

ˆ( ) log2j

j j

pBIC M l D R

ˆjl D

Page 28: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 28

Log-likelihood

Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution. The log-likelihood of the data can be

considered as a measure of the cohesiveness of a cluster.

It estimates how closely to the centroid are the points of the cluster.

Page 29: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 29

Parameters pj

Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions.

We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR).

Page 30: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 30

Notation μj : coordinates of j-th centroid μ(i) : centroid nearest to i-th data point D: input set of data points Dj : set of data points that have μ(j) as their

closest centroid R = |D| and Ri = |Di| M: the number of dimensions Mj: family of alternative models (different

models correspond clustering solutions) BIC scores the models and chooses the best

among K models

Page 31: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 31

Computing BIC (1/3)

To compute log-likelihood of data we need the parameters of the Gaussian for the data

Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption)

i

iixKR2)(

2 1

Page 32: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 32

Computing BIC (2/3)

Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi

Log likelihood of data

2

2i 2

1exp

2

1xP iM

i xR

R

i

iii i R

RxixPDl log

2

1

2

1loglog)(

2

2

Page 33: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 33

Computing BIC (3/3)

Focusing on the set Dn of points which belong to centroid n

RRRRR

MRRDl

nnnn

nnn

loglog2

)log(2

)2log(2

)( 2

Page 34: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 34

Proposed Method: BIC-Means (1/2)

BIC: Bisecting InCremental K-Means clustering incorporating BIC as the stopping criterion. BIC performs a splitting test at each leaf

cluster to prevent it from over-splitting. BIC-Means doesn’t terminate at singleton

clusters. BIC-Means terminates when there are no

separable clusters according to BIC.

Page 35: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 35

Proposed Method: BIC-Means (2/2)

Combines the strengths of partitional and hierarchical clustering methods Hierarchical clustering Low complexity (O(N*K)) Good clustering quality Produces meaningful clusters at the

leafs

Page 36: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 36

BIC-Means Algorithm

Input: S: (d1, d2,…,dn) data in one cluster

Output: A hierarchy of clusters. 1. All documents in one cluster C.

2. Apply Incremental K-Means to split C into C1, C2.

3. Compute BIC for C and C1, C2 :

I. If BIC(C) < BIC(C1, C2) put C1, C2 in queue

II. Otherwise do not split C

4. Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC.

Page 37: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 37

Evaluation

Evaluation of document clustering algorithms. Two data sets: OHSUMED (233,445 Medline

documents), Reuters (21578 documents). Application of clustering to information retrieval

Evaluation of several cluster-based retrieval strategies.

Comparison with retrieval by exhaustive search on OHSUMED.

Page 38: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 38

F-Measure

Howe good the clusters approximate data classes F-Measure for cluster C and class T is defined as:

, where , The F measure of a class T is the maximum value it

achieves over all clusters C:

FT= maxCFTC

The F measure of the clustering solution is the mean FT (over all classes)

2 /( )F Measure P R P R /R N T/P N C

TT

FT

CF

Page 39: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 39

Comparison of Clustering Algorithms

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

K-Means Incremental K-Means Bisecting Increm. K-Means

Avg

F-M

easu

re (

10 t

rials

)

Algorithms

Comparison of K-Means, Incremental K-Means and BisectingOhsumed1 - Reuters1 data sets

Reuters1 OHSUMED1

Page 40: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 40

Evaluation of Incremental K-Means

Incremental K-Means - Reuters1Number of Iterations of Center adjustment

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 iteration 2 iterations 3 iterations 4 iterations

Number of Iterations

Avg

F-M

easu

re(1

0 tr

ials

)

Page 41: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 41

MeSH Representation of Documents

We use MeSH terms for describing medical documents (OHSUMED).

Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms).

Leads to more compact representation (each vector contains less terms, about 20).

Sequential approach to extract MeSH terms from OHSUMED documents.

Page 42: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 42

Bisecting Incremental K-Means – Clustering Quality

Bisecting Incremental K-M eans- OHSUM ED2 M eSH terms Vs Single Word Terms Representation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Single-Word Term Representation MeSH Term Representation

Document Representation

Av

g F

-Me

as

ure

(1

0 t

ria

ls)

Page 43: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 43

Speed of Clustering

Bisecting Incremental K-M eans - Ohsumed2 M eSH-based Vs Single word Terms Representation

Single-Word Terms Representation - 97.6 min

MeSH Terms Representation,14 min.

0

10

20

30

40

50

60

70

80

90

100

110

Single-Word Term Representation MeSH Term Representation

Document Representation

Av

g C

lus

teri

ng

Tim

e (

min

)

Page 44: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 44

Evaluation of BIC-Means

Comparison of BIC-Means and Bisecting Incremental K-MeansF-Measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Ohsumed2 Reuters1 Reuters2

Data Set

Av

g F

-Me

as

ure

(1

0 t

ria

ls)

BIC-Means Bisecting Incremental K-Means

Page 45: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 45

Speed of Clustering

Comparison of BIC-Means and Bisecting Incremental K-Means Clustering Time

0102030405060708090

100110120130140150160170180

Ohsumed2 Reuters1 Reuters2

Data Set

Av

g C

lus

teri

ng

Tim

e (

min

)

BIC-Means Bisecting Incremental K-Means

Page 46: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 46

Comments

BIC-Means is much faster than Bisecting Incremental K-Means Not exhaustive algorithm.

Achieves approximately the same F-Measure with the exhaustive Bisecting approach.

It is more suited for clustering large document collections.

Page 47: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 47

Application of Clustering to Information Retrieval

We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED).

BIC-Means is applied on entire OHSUMED. Each document is represented by MeSH terms. Chose 61 queries of the original OHSUMED

query set developed by Hersh et. al. Each OHSUMED document has been judged as

relevant to a query.

Page 48: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 48

Query – Document Similarity

1 2

1 2

11 21 2

2 21 2

1 1

( , )| || |

M

id idi

M M

id idi i

w wd dSim d d

d d w w

Similarity is defined as the cosine of the angle between document and query vectors.

θ

d1

d2

Page 49: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 49

Information Retrieval Methods

Method 1: Search M clusters closer to the query Compute similarity between cluster centroid - query

Method 2: Search M clusters closer to the query Each cluster is represented by the 20 most frequent

terms of its centroid.

Method 3: Search M clusters whose centre contain the terms of the query.

Page 50: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 50

Method 1: Search M clusters closer to the query (compute similarity between cluster centroid – query).

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

recall

pre

cis

ion

top_1Cluster

top_3Clusters

top_10Clusters

top_30Clusters

top_50Clusters

top_100Clusters

top_150Clusters

Exhaustive Search

Page 51: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 51

Method 2: Search M clusters closer to the query. Each cluster is represented by the 20 most frequent terms of its centroid.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4recall

pre

cis

ion

20Terms-top_10Clusters

20Terms-top_50Clusters

20Terms-top_100Clusters

20Terms-top_150Clusters

top_150Clusters

Exhaustive Search

Page 52: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 52

Method 3: Search M clusters containing the terms of the query.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

recall

pre

cisi

on

AllQinCen_Top_15Clus ters

AllQinCen_Top_30Clus ters

AllQinCen_Top_50Clus ters

AllQinCen_AllClus ters

Exhaustive Search

Page 53: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 53

Size of Search

"Avg Num ber of Docum ents searched over the 61 queries"Retrieval Strategy: Retrieve the clusters w hich contain all the MeSH Query

Term s in the ir Centroid.

0

35000

70000

105000

140000

175000

210000

245000

VSM AllClusters Top_50Clusters Top_30Clusters Top_15Clusters

Search Strategy

Nu

m O

F D

ocs

Page 54: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 54

Comments

Best cluster-based retrieval strategy: Retrieve only the clusters which contain all the MeSH

query terms in their centroid vector (Method 3). Search the documents which are contained in the

retrieved clusters and order them by similarity with the query.

Advantages: Searches only 30% of all OHSUMED documents as

opposed to exhaustive searching (233,445 docs). Almost as effective as the retrieval by exhaustive

searching (searching without clustering).

Page 55: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 55

Conclusions (1/2)

We implemented and evaluated various partitional clustering techniques Incremental K-Means Bisecting Incremental K-Means (exhaustive

approach)

BIC-Means Incorporates BIC as stopping criterion for preventing

clustering from over-splitting. Produces meaningful clusters at the leafs.

Page 56: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 56

Conclusions (2/2)

BIC-Means Much faster than Bisecting Incremental K-Means. As effective as exhaustive Bisecting approach. More suited for clustering large document collection.

Cluster-based retrieval strategies Reduces the size of the search. The best proposed retrieval method is as effective

as exhaustive searching (searching without clustering).

Page 57: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 57

Future Work

Evaluation using more or application specific data sets.

Examine additional cluster-based retrieval strategies (top-down, bottom-up).

Clustering and Browsing on Medline. Clustering Dynamic Document Collections. Semantic Similarity Methods in document

clustering.

Page 58: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 58

References Nikos Hourdakis, Michalis Argyriou, Euripides

G.M. Petrakis, Evangelos Milios, " Hierarchical Clustering in Medical Document Collections: the BIC-Means Method", Journal of Digital Information Management (JDIM),  Vol. 8, No. 2, pp. 71-77, April. 2010.

Dan Pelleg, Andrew Moore, “X-means: Extending K-means with efficient estimation of the number of clusters”, Proc. of the 7th Intern. Conf. on Machine Learning, 2000, pp. 727-734

Page 59: Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

22/04/23 Nikos Hourdakis, MSc Thesis 59

Thank you!!!

Questions?