Upload
diego-molla-aliod
View
241
Download
4
Tags:
Embed Size (px)
DESCRIPTION
A. Ekbal, S. Saha, D. Mollá, and K. Ravikumar. Multi-Objective Optimization for Clustering of Medical Publications (2013). Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013), pp53-61, Brisbane, Australia. http://aclweb.org/anthology/U/U13/
Citation preview
Multi-Objective Optimization for Clustering ofMedical Publications
Asif Ekbal1 Sriparna Saha1 Diego Molla2 K Ravikumar1
Centre for Language Technology2
India Institute of Technology1 Macquarie UniversityPatna, Bihar, India Sydney, Australia
ALTA 2013, Brisbane, Australia
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 2/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 3/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Evidence Based Medicine
http://laikaspoetnik.wordpress.com/2009/04/04/evidence-based-medicine-the-facebook-of-medicine/
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 4/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
The Dream
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 5/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
The Bottom-line Answer
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 6/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
A Means of Getting There
Input
QUESTION:Which treatmentswork best forhemorrhoids?
DOCUMENTS:[11289288][12972967][1442682][15486746][16235372][16252313][17054255][17380367]
clustering
=⇒
summarisation
Output
1. Excision is the most effectivetreatment for thrombosedexternal hemorrhoids.[11289288] [12972967][15486746]
2. For prolapsed internalhemorrhoids, the bestdefinitive treatment istraditional hemorrhoidectomy.[17054255] [17380367]
3. Of nonoperative techniques,rubber band ligation producesthe lowest rate of recurrence.[1442682] [16252313][16235372]
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 7/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
This Work
Each question is formulated as an independent clustering task.
Input
QUESTION:Which treatments workbest for hemorrhoids?
DOCUMENTS:[11289288] [12972967][1442682] [15486746][16235372] [16252313][17054255] [17380367]
clustering
=⇒
Output
1. [11289288] [12972967][15486746]
2. [17054255] [17380367]
3. [1442682] [16252313][16235372]
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 8/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Related Work
Uses of Document Clustering
I Web search
I Topic detection andtracking
I Training data expansion
I Multi-documentsummarisation
Clustering in EBM
I Cluster search results
I Cluster based oninterventions
I Shash & Molla (2013):k-means clustering on ourdata set
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 9/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 10/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Clustering and Multi-Objective Optimization
I Most existing clustering techniques are based on a singlecriterion of goodness.
I Several criteria of goodness have been proposed.
I So why not try several criteria at once?
Internal Validity
I BIC-index
I CH-index
I Silhouette-index
I DB-index
I . . .
External Validity
I Minkowski scores
I F-measures
I . . .
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 11/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Information in Internal Validity Indices
Compactness
I Measures the distance among the various elements of thecluster.
I We want clusters with short distances between its elements.
Separability
I Measures the distance between clusters.
I We want relatively large distances between clusters.
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 12/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
I -Index (Maulik & Bandyopadhyay, 2002)
I (K ) = (1
K× E1
EK× DK )p
K = number of clusters
EK =∑K
k=1
∑nkj=1 de(ck , x
kj )
DK = maxKi ,j=1 de(c i , c j)
c j = centroid of the jth clusterxkj = jth point of the kth cluster
nk = total number of points present in the kth cluster
I E1EK increases I as the clusters become more compact.
I DK increases I as the separation between clusters increase.
I (p is a parameter set to 2 in this paper)
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 13/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
XB-Index (Xie & Beni, 1991)
XB(K ) =
∑Ki=1
∑nj=1 u
2ij‖x j − c i‖2
n(mini 6=k ‖c i − ck‖2)
K = number of clustersc j = centroid of the jth clusterxkj = jth point of the kth cluster
n = total number of points present in the dataset[uij ]K×n = cluster membership matrix
I The numerator quantifies the compactness of the clusters.
I The denominator quantifies the separation between clusters.
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 14/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
MOO: The Pareto Optimal Front
f1(maximize)
f2(minimize)
2
1
4
3
5
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 15/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 16/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
String Representation
I AMOSA-clus implements simulated annealing (SA).I Centroid-based real-encoding:
I Each member of the archive is encoded as a string thatrepresents the centroids of the partitions.
I Each centroid is indivisible.
I Given a fixed maximum number of clusters Kmax , the initialnumber of centroids and their centroids are determinedrandomly.
< 12.3 1.4 22.1 0.01 0.0 15.3 10.2 7.5 >
Represents four cluster centroids:
(12.3, 1.4), (22.1, 0.01), (0.0, 15.3), (10.2, 7.5)
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 17/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Assignment of Points to the Clusters
Assignment of points and update of cluster centroids resembles aniteration of the K -means clustering algorithm.
1. A point j is assigned to the cluster k whose centroid has theminimum distance to j :
k = argmini=1,...Kd(x j , c i ) (1)
2. After all points are assigned to a cluster, the cluster centroidsare updated:
c i =
∑nij=1(x ij)
ni, 1 ≤ i ≤ K (2)
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 18/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Search Operators
Mutation 1 Perturb the centroids of a random cluster using aLaplacian distribution:
p(ε) ∝ e−|ε−µ|
δ
Mutation 2 Delete a random cluster centroid.
Mutation 3 Add a new cluster centroid.
< 3.5 1.5 2.1 4.9 1.6 1.2 >
1. If we choose centroid 2, then update centroid (2.1, 4.9). Thenew string is: < 3.5 1.5 1.2 3.6 1.6 1.2 >
2. If we choose centroid 3, the new string will be:< 3.5 1.5 2.1 4.9 >.
3. New string: < 3.5 1.5 2.1 4.9 1.6 1.2 9.7 2.5 >
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 19/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Selecting a Solution
I The algorithm produces a set of alternative solutions.
I Each solution is optimal according to some criteria.
Unsupervised Setting
Choose one solution randomly.
f1(maximize)
f2(minimize)
2
1
4
3
5
Semi-supervised Setting
I Each question has aportion of knownclustering assignments.
I Select the solution withbest entropy in knownassignments.
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 20/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 21/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Data
I Clinical Inquiries from the Journal of Family Practice.
I 276 clinical questions (276 clustering tasks).
I Each question has an average of 5.89 documents.
Which treatments work best for hemorrhoids?
1. Excision is the most effective treatment for thrombosed externalhemorrhoids. [11289288] [12972967] [15486746]
2. For prolapsed internal hemorrhoids, the best definitive treatment istraditional hemorrhoidectomy. [17054255] [17380367]
3. Of nonoperative techniques, rubber band ligation produces thelowest rate of recurrence. [1442682] [16252313] [16235372]
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 22/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Results
DistanceMeasure
AMOSA-clus1 AMOSA-clus2 K-means(baseline)best average best average
Euclidean 0.190 0.249 0.177 0.235 0.240Cosine 0.187 0.231 0.177 0.230 0.237
Unsupervised: Average solution is slightly better than baseline(differences statistically significant).
Semi-supervised: Best solution is clearly better than baseline(differences statistically significant).
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 23/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Finding the Number of Clusters
DistanceMeasure
AMOSA-clus1 AMOSA-clus2 K-means(baseline)best average best average
Euclidean 0.190 0.249 0.177 0.235 0.240Cosine 0.187 0.231 0.177 0.230 0.237
AMOSA-clus1: Number of clusters as given by the original data.
I Average 2.38 clusters.
AMOSA-clus2: Try several numbers of clusters and select thesolution that optimises I -index and XB-index.
I Euclidean distance: Average 2.34 clusters.I Cosine distance: Average 2.51 clusters.
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 24/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Finding the Number of Clusters
error =
∑i (targeti − predictedi )
2
# of questions
Method Error
AMOSA-clus2 Cosine 1.90AMOSA-clus2 Euclidean 1.91k = 1 3.91k = 2 2.14k = 3 2.38k = 4 4.61Rule of Thumb 2.56Cover 1.98
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 25/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Conclusions
Conclusions
I Unsupervised setting: slight improvement over k-means baseline.
I Semi-supervised setting: clear improvement over k-means baseline.
I Number of clusters: better than standard methods.
Further Work
I Test on other domains.
I Test using other cluster validity indices.
I Compare with other semi-supervised methods.
Questions?
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 26/26
Clustering for Evidence Based Medicine Clustering as a MOO Problem AMOSA-clus Results
Conclusions
Conclusions
I Unsupervised setting: slight improvement over k-means baseline.
I Semi-supervised setting: clear improvement over k-means baseline.
I Number of clusters: better than standard methods.
Further Work
I Test on other domains.
I Test using other cluster validity indices.
I Compare with other semi-supervised methods.
Questions?
MOO for Medical Clustering Asif Ekbal, Sriparna Saha, Diego Molla, K Ravikumar 26/26