16
Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Embed Size (px)

DESCRIPTION

Presentation given by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK 24-27 September 2012 TPDL 2012, Cyprus

Citation preview

Page 1: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Evaluating the Use of Clustering for Automatically Organising Digital Library

Collections

Mark M. Hall, Mark Stevenson, Paul D. Clough

TPDL 2012, Cyprus, 24-27 September 2012

Page 2: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Opening Up Digital Cultural Heritage

TPDL 2012, Cyprus, 24-27 September 2012http://www.flickr.com/photos/usnationalarchives/4069633668/

Carl Collinshttp://www.flickr.com/photos/carlcollins/199792939/

http://www.flickr.com/photos/brokenthoughts/122096903/

Page 3: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Exploring Collections

• Exploring / Browsing as an alternative to Search (where applicable)

• Requires some kind of structuring of the data

• Manual structuring ideal– Expensive to generate– Integration of collections problematic

• Alternative: Automatic structuring via clustering

TPDL 2012, Cyprus, 24-27 September 2012

Page 4: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Test Collection

• 28133 photographs provided by the University of St Andrews Library– 85% pre 1940– 89% black and white– Majority UK– Title and description tend to be

short

TPDL 2012, Cyprus, 24-27 September 2012

Ottery St Mary Church

Page 5: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Tested Clustering Strategies

• Latent Dirichlet Allocation (LDA)– 300 & 900 topics– With and without Pairwise Mutual Information

(PMI) filtering

• K-Means– 900 clusters– TFIDF vectors & LDA topic vectors

• OPTICS– 900 clusters– TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012

Page 6: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Processing Time

Model Wall-clock TimeLDA 300 00:21:48LDA 900 00:42:42LDA + PMI 300 05:05:13LDA + PMI 900 17:26:08K-Means TFIDF 09:37:40K-Means LDA 03:49:04Optics TFIDF 12:42:13Optics LDA 05:12:49

TPDL 2012, Cyprus, 24-27 September 2012

Page 7: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Evaluation Metrics

• Cluster cohesion– Items in a cluster should be similar to each

other– Items in a cluster should be different from

items in other clusters

• How to test this?– “Intruder” test– If you insert an intruder into a cluster, can

people find it

TPDL 2012, Cyprus, 24-27 September 2012

Page 8: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Intruder Test

1. Randomly select one topic

2. Randomly select four items from the topic

3. Randomly select a second topic – the “intruder” topic

4. Randomly select one item from the second topic – the “intruder” item

5. Scramble the five items and let the user choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012

Page 9: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Cluster Cohesion – Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Page 10: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Cluster Cohesion – Not Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Page 11: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Evaluation Metrics

• Cohesive– “Intruder” is chosen significantly more

frequently than by chance– Choice distribution is significantly different

from the uniform distribution

• Borderline cohesive– Two out of five items make up > 95% of the

answers– “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012

Page 12: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Evaluation Bounds

• Upper bound– Manual annotation

• 936 topics

• Lower bound– 3 cohesive topics– <5% likelihood of seeing that number of cohesive

topics by chance

• Control data– 10 “really, totally, completely obvious” intruders used

to filter participants who randomly select answers

TPDL 2012, Cyprus, 24-27 September 2012

Page 13: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Experiment

• Crowd-sourced using staff & students at Sheffield University– 700 participants

• 9 clustering strategies– 30 units per strategy – total of 270 units

• Results– 8840 ratings– 21 – 30 ratings per unit (median 27 ratings)

TPDL 2012, Cyprus, 24-27 September 2012

Page 14: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

ResultsModel Cohesive Borderline Non-CohesiveUpper Bound 27 0 3Lower Bound 3 0 27LDA 300 15 6 9LDA 900 20 4 6LDA + PMI 300 16 4 10LDA + PMI 900 21 2 7K-Means TFIDF 24 3 3K-Means LDA 20 0 10Optics TFIDF 14 2 14Optics LDA 16 0 14

TPDL 2012, Cyprus, 24-27 September 2012

Page 15: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Conclusions

• K-means almost as good as the human classification

• LDA is very fast and approximately two thirds of the topics are acceptably cohesive

• Future work:– Make it hierarchical– Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012

Page 16: Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Thank you for listening

[email protected]

http://www.paths-project.eu

Find out more about the project:

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).