Upload
pathsproject
View
250
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation given by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK 24-27 September 2012 TPDL 2012, Cyprus
Citation preview
Evaluating the Use of Clustering for Automatically Organising Digital Library
Collections
Mark M. Hall, Mark Stevenson, Paul D. Clough
TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage
TPDL 2012, Cyprus, 24-27 September 2012http://www.flickr.com/photos/usnationalarchives/4069633668/
Carl Collinshttp://www.flickr.com/photos/carlcollins/199792939/
http://www.flickr.com/photos/brokenthoughts/122096903/
Exploring Collections
• Exploring / Browsing as an alternative to Search (where applicable)
• Requires some kind of structuring of the data
• Manual structuring ideal– Expensive to generate– Integration of collections problematic
• Alternative: Automatic structuring via clustering
TPDL 2012, Cyprus, 24-27 September 2012
Test Collection
• 28133 photographs provided by the University of St Andrews Library– 85% pre 1940– 89% black and white– Majority UK– Title and description tend to be
short
TPDL 2012, Cyprus, 24-27 September 2012
Ottery St Mary Church
Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)– 300 & 900 topics– With and without Pairwise Mutual Information
(PMI) filtering
• K-Means– 900 clusters– TFIDF vectors & LDA topic vectors
• OPTICS– 900 clusters– TFIDF vectors & LDA topic vectors
TPDL 2012, Cyprus, 23-27 September 2012
Processing Time
Model Wall-clock TimeLDA 300 00:21:48LDA 900 00:42:42LDA + PMI 300 05:05:13LDA + PMI 900 17:26:08K-Means TFIDF 09:37:40K-Means LDA 03:49:04Optics TFIDF 12:42:13Optics LDA 05:12:49
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cluster cohesion– Items in a cluster should be similar to each
other– Items in a cluster should be different from
items in other clusters
• How to test this?– “Intruder” test– If you insert an intruder into a cluster, can
people find it
TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the “intruder” topic
4. Randomly select one item from the second topic – the “intruder” item
5. Scramble the five items and let the user choose which one is the “intruder”
TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive
TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cohesive– “Intruder” is chosen significantly more
frequently than by chance– Choice distribution is significantly different
from the uniform distribution
• Borderline cohesive– Two out of five items make up > 95% of the
answers– “Intruder” is one of those two
TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds
• Upper bound– Manual annotation
• 936 topics
• Lower bound– 3 cohesive topics– <5% likelihood of seeing that number of cohesive
topics by chance
• Control data– 10 “really, totally, completely obvious” intruders used
to filter participants who randomly select answers
TPDL 2012, Cyprus, 24-27 September 2012
Experiment
• Crowd-sourced using staff & students at Sheffield University– 700 participants
• 9 clustering strategies– 30 units per strategy – total of 270 units
• Results– 8840 ratings– 21 – 30 ratings per unit (median 27 ratings)
TPDL 2012, Cyprus, 24-27 September 2012
ResultsModel Cohesive Borderline Non-CohesiveUpper Bound 27 0 3Lower Bound 3 0 27LDA 300 15 6 9LDA 900 20 4 6LDA + PMI 300 16 4 10LDA + PMI 900 21 2 7K-Means TFIDF 24 3 3K-Means LDA 20 0 10Optics TFIDF 14 2 14Optics LDA 16 0 14
TPDL 2012, Cyprus, 24-27 September 2012
Conclusions
• K-means almost as good as the human classification
• LDA is very fast and approximately two thirds of the topics are acceptably cohesive
• Future work:– Make it hierarchical– Create hybrid algorithms
TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening
http://www.paths-project.eu
Find out more about the project:
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).