Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09

Classifying Tags Using Open Content ResourcesSimon Overell, Borkur Sigurbjornsson & Roelof van Zwol

WSDM ‘09

Motivation Classify tags in Flickr as broad categories

such as what, where, when and who Easier indexing and navigation WordNet is usually used for

classification but has limited coverage

Example

The ClassTag System

Classifying Wikipedia Articles Using only metadata (i.e. Categories

and Templates) – high scalability Supervised Classifier

Articles as objects WordNet noun semantic categories as

classification classes Categories and Templates as features

Support Vector Machine (SVM) as classifier

Categories and Templates

Categories and Templates

Supervised Classification Ground Truth

All Wikipedia articles that match WordNet nouns

Data Sparsity WordNet categories under represented

(10 out of 25) Articles have very few features

Reducing Data Sparsity Using category and

template network transclusion

… but noise is added

System Optimization Number of arcs traversed in

Category network Template network

Choice of weighting function Term Frequency (tf) Term Frequency – Inverse Document

Frequency (tf-idf) Term Frequency – Inverse Layer (tf-il)

Example

Fine Tuning Partitioned the ground truth into training

and test sets Criteria

At least 80% precision Maximum possible recall

Resulted optimal values Category arcs: 3, Template arcs: 3, TF-IL Precision: 87% F1-Measure:0.696

SVM Threshold SVM outputs confidence with which an

article is correctly classified as a member of a category

Training experiment with 250 Wikipedia articles (1 assessor)

SVM Threshold

SVM Threshold

Summary Optimised for Recall (ClassTag)

39% of Articles classified 664,770 Wikipedia articles

Optimised for Precision (ClassTag+) 21% of Articles classified 338,061 Wikipedia articles

Comparison with DBpedia• Experimental Setup

– 300 pooled articles– 3 Assessors– Blind Assessments– 50 articles overlap

• Partial Agreement:– 86%

• Total Agreement:– 78%

Results

Classification of Flickr Tags Tag Anchor Text

String matching Anchor Text Wikipedia Article

Number of times an anchor refers to a Wikipedia article

Wikipedia Article Category Output of SVM decision

Ambiguity Tag Anchor Text

Some ambiguity because often tags are lower case with no white spaces

Anchor Text Wikipedia Article 13.4% of Anchor text -> Wikipedia Article mappings

ambiguous 4% of Anchor text -> Category mappings ambiguous Example

George Bush -> George W. Bush, George Bush Senior George Bush -> Person

Wikipedia Article Category 5.7% of classified articles result in multiple classification

Example

Evaluation WordNet classification extended

vocabulary coverage by 115% Taking tag frequency into account

ClassTag classified 69.2% of Flickr tags 22% more than WordNet baseline

Tag distribution

Multilanguage Classification 80% of tags in English, 7% in German

and 6% in Dutch Maybe a portion of the unclassified tags

fall into this category Possible alternate language classification

Run ClassTag using alternate Wikipedia language and a corresponding lexicon

Translate the English classification using Wikipedia’s interlanguage links

Contributions Classifying open content resources

using their structural patterns Presenting ClassTag - a system for

classifying tags ClassTag extends the WordNet lexicon

using the structural patterns of Wikipedia

Conclusion Tuneable system for classifying

Wikipedia pages ClassTag: Nearly 40% of articles classified

with a precision of 72% ClassTag+: 21% of articles classified with

a precision of 86% (equal to assessor agreement)

Nearly 70% of Flickr tags matched to WordNet categories

Documents

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09