27
Multi-Objective Optimization for Clustering of Medical Publications Asif Ekbal 1 Sriparna Saha 1 Diego Moll´ a 2 K Ravikumar 1 Centre for Language Technology 2 India Institute of Technology 1 Macquarie University Patna, Bihar, India Sydney, Australia ALTA 2013, Brisbane, Australia

Multi-Objective Optimization for Clustering of Medical Publications

Embed Size (px)

DESCRIPTION

A. Ekbal, S. Saha, D. Mollá, and K. Ravikumar. Multi-Objective Optimization for Clustering of Medical Publications (2013). Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013), pp53-61, Brisbane, Australia. http://aclweb.org/anthology/U/U13/

Citation preview

Page 1: Multi-Objective Optimization for Clustering of Medical Publications

Multi-Objective Optimization for Clustering ofMedical Publications

Asif Ekbal1 Sriparna Saha1 Diego Molla2 K Ravikumar1

Centre for Language Technology2

India Institute of Technology1 Macquarie UniversityPatna, Bihar, India Sydney, Australia

ALTA 2013, Brisbane, Australia

Page 2: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Contents

Clustering for Evidence Based Medicine

Clustering as a MOO Problem

AMOSA-clus

Results

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 2/26

Page 3: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Contents

Clustering for Evidence Based Medicine

Clustering as a MOO Problem

AMOSA-clus

Results

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 3/26

Page 4: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Evidence Based Medicine

http://laikaspoetnik.wordpress.com/2009/04/04/evidence-based-medicine-the-facebook-of-medicine/

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 4/26

Page 5: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

The Dream

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 5/26

Page 6: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

The Bottom-line Answer

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 6/26

Page 7: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

A Means of Getting There

Input

QUESTION:Which treatmentswork best forhemorrhoids?

DOCUMENTS:[11289288][12972967][1442682][15486746][16235372][16252313][17054255][17380367]

clustering

=⇒

summarisation

Output

1. Excision is the most effectivetreatment for thrombosedexternal hemorrhoids.[11289288] [12972967][15486746]

2. For prolapsed internalhemorrhoids, the bestdefinitive treatment istraditional hemorrhoidectomy.[17054255] [17380367]

3. Of nonoperative techniques,rubber band ligation producesthe lowest rate of recurrence.[1442682] [16252313][16235372]

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 7/26

Page 8: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

This Work

Each question is formulated as an independent clustering task.

Input

QUESTION:Which treatments workbest for hemorrhoids?

DOCUMENTS:[11289288] [12972967][1442682] [15486746][16235372] [16252313][17054255] [17380367]

clustering

=⇒

Output

1. [11289288] [12972967][15486746]

2. [17054255] [17380367]

3. [1442682] [16252313][16235372]

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 8/26

Page 9: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Related Work

Uses of Document Clustering

I Web search

I Topic detection andtracking

I Training data expansion

I Multi-documentsummarisation

Clustering in EBM

I Cluster search results

I Cluster based oninterventions

I Shash & Molla (2013):k-means clustering on ourdata set

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 9/26

Page 10: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Contents

Clustering for Evidence Based Medicine

Clustering as a MOO Problem

AMOSA-clus

Results

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 10/26

Page 11: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Clustering and Multi-Objective Optimization

I Most existing clustering techniques are based on a singlecriterion of goodness.

I Several criteria of goodness have been proposed.

I So why not try several criteria at once?

Internal Validity

I BIC-index

I CH-index

I Silhouette-index

I DB-index

I . . .

External Validity

I Minkowski scores

I F-measures

I . . .

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 11/26

Page 12: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Information in Internal Validity Indices

Compactness

I Measures the distance among the various elements of thecluster.

I We want clusters with short distances between its elements.

Separability

I Measures the distance between clusters.

I We want relatively large distances between clusters.

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 12/26

Page 13: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

I -Index (Maulik & Bandyopadhyay, 2002)

I (K ) = (1

K× E1

EK× DK )p

K = number of clusters

EK =∑K

k=1

∑nkj=1 de(ck , x

kj )

DK = maxKi ,j=1 de(c i , c j)

c j = centroid of the jth clusterxkj = jth point of the kth cluster

nk = total number of points present in the kth cluster

I E1EK increases I as the clusters become more compact.

I DK increases I as the separation between clusters increase.

I (p is a parameter set to 2 in this paper)

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 13/26

Page 14: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

XB-Index (Xie & Beni, 1991)

XB(K ) =

∑Ki=1

∑nj=1 u

2ij‖x j − c i‖2

n(mini 6=k ‖c i − ck‖2)

K = number of clustersc j = centroid of the jth clusterxkj = jth point of the kth cluster

n = total number of points present in the dataset[uij ]K×n = cluster membership matrix

I The numerator quantifies the compactness of the clusters.

I The denominator quantifies the separation between clusters.

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 14/26

Page 15: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

MOO: The Pareto Optimal Front

f1(maximize)

f2(minimize)

2

1

4

3

5

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 15/26

Page 16: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Contents

Clustering for Evidence Based Medicine

Clustering as a MOO Problem

AMOSA-clus

Results

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 16/26

Page 17: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

String Representation

I AMOSA-clus implements simulated annealing (SA).I Centroid-based real-encoding:

I Each member of the archive is encoded as a string thatrepresents the centroids of the partitions.

I Each centroid is indivisible.

I Given a fixed maximum number of clusters Kmax , the initialnumber of centroids and their centroids are determinedrandomly.

< 12.3 1.4 22.1 0.01 0.0 15.3 10.2 7.5 >

Represents four cluster centroids:

(12.3, 1.4), (22.1, 0.01), (0.0, 15.3), (10.2, 7.5)

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 17/26

Page 18: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Assignment of Points to the Clusters

Assignment of points and update of cluster centroids resembles aniteration of the K -means clustering algorithm.

1. A point j is assigned to the cluster k whose centroid has theminimum distance to j :

k = argmini=1,...Kd(x j , c i ) (1)

2. After all points are assigned to a cluster, the cluster centroidsare updated:

c i =

∑nij=1(x ij)

ni, 1 ≤ i ≤ K (2)

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 18/26

Page 19: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Search Operators

Mutation 1 Perturb the centroids of a random cluster using aLaplacian distribution:

p(ε) ∝ e−|ε−µ|

δ

Mutation 2 Delete a random cluster centroid.

Mutation 3 Add a new cluster centroid.

< 3.5 1.5 2.1 4.9 1.6 1.2 >

1. If we choose centroid 2, then update centroid (2.1, 4.9). Thenew string is: < 3.5 1.5 1.2 3.6 1.6 1.2 >

2. If we choose centroid 3, the new string will be:< 3.5 1.5 2.1 4.9 >.

3. New string: < 3.5 1.5 2.1 4.9 1.6 1.2 9.7 2.5 >

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 19/26

Page 20: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Selecting a Solution

I The algorithm produces a set of alternative solutions.

I Each solution is optimal according to some criteria.

Unsupervised Setting

Choose one solution randomly.

f1(maximize)

f2(minimize)

2

1

4

3

5

Semi-supervised Setting

I Each question has aportion of knownclustering assignments.

I Select the solution withbest entropy in knownassignments.

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 20/26

Page 21: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Contents

Clustering for Evidence Based Medicine

Clustering as a MOO Problem

AMOSA-clus

Results

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 21/26

Page 22: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Data

I Clinical Inquiries from the Journal of Family Practice.

I 276 clinical questions (276 clustering tasks).

I Each question has an average of 5.89 documents.

Which treatments work best for hemorrhoids?

1. Excision is the most effective treatment for thrombosed externalhemorrhoids. [11289288] [12972967] [15486746]

2. For prolapsed internal hemorrhoids, the best definitive treatment istraditional hemorrhoidectomy. [17054255] [17380367]

3. Of nonoperative techniques, rubber band ligation produces thelowest rate of recurrence. [1442682] [16252313] [16235372]

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 22/26

Page 23: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Results

DistanceMeasure

AMOSA-clus1 AMOSA-clus2 K-means(baseline)best average best average

Euclidean 0.190 0.249 0.177 0.235 0.240Cosine 0.187 0.231 0.177 0.230 0.237

Unsupervised: Average solution is slightly better than baseline(differences statistically significant).

Semi-supervised: Best solution is clearly better than baseline(differences statistically significant).

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 23/26

Page 24: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Finding the Number of Clusters

DistanceMeasure

AMOSA-clus1 AMOSA-clus2 K-means(baseline)best average best average

Euclidean 0.190 0.249 0.177 0.235 0.240Cosine 0.187 0.231 0.177 0.230 0.237

AMOSA-clus1: Number of clusters as given by the original data.

I Average 2.38 clusters.

AMOSA-clus2: Try several numbers of clusters and select thesolution that optimises I -index and XB-index.

I Euclidean distance: Average 2.34 clusters.I Cosine distance: Average 2.51 clusters.

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 24/26

Page 25: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Finding the Number of Clusters

error =

∑i (targeti − predictedi )

2

# of questions

Method Error

AMOSA-clus2 Cosine 1.90AMOSA-clus2 Euclidean 1.91k = 1 3.91k = 2 2.14k = 3 2.38k = 4 4.61Rule of Thumb 2.56Cover 1.98

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 25/26

Page 26: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Conclusions

Conclusions

I Unsupervised setting: slight improvement over k-means baseline.

I Semi-supervised setting: clear improvement over k-means baseline.

I Number of clusters: better than standard methods.

Further Work

I Test on other domains.

I Test using other cluster validity indices.

I Compare with other semi-supervised methods.

Questions?

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 26/26

Page 27: Multi-Objective Optimization for Clustering of Medical Publications

Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results

Conclusions

Conclusions

I Unsupervised setting: slight improvement over k-means baseline.

I Semi-supervised setting: clear improvement over k-means baseline.

I Number of clusters: better than standard methods.

Further Work

I Test on other domains.

I Test using other cluster validity indices.

I Compare with other semi-supervised methods.

Questions?

MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 26/26