24
Tagging vs. Controlled Vocabulary Which is More Helpful for Book Search? Toine Bogers 1 & Vivien Petras 2 1 Aalborg University Copenhagen, Denmark 2 Humboldt-Universität zu Berlin, Germany iConference 2015, Newport Beach March 25, 2015

Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?

Embed Size (px)

Citation preview

Tagging vs. Controlled Vocabulary !Which is More Helpful for Book Search?

Toine Bogers1 & Vivien Petras2 1 Aalborg University Copenhagen, Denmark 2 Humboldt-Universität zu Berlin, Germany iConference 2015, Newport Beach March 25, 2015

2

Outline

•  Introduction

•  Methodology

•  Results

•  Follow-up analysis

•  Conclusions & Future Work

Tagging vs. Controlled Vocabulary Indexing

Controlled Vocabulary (CV)

+ Semantic relationships - Large development costs

Tagging

+ Use of the users’ vocabulary - No term normalization

5

Previous studies: ! Analyze nature of terms mostly: overlap / complementary vocabulary ! Few and conflicting results for retrieval ! Small samples

Study Objectives

What do tags and controlled vocabularies really bring to the table in a realistic search environment?

1.  Which (combination of) metadata elements can best contribute to retrieval success?

2.  How does the retrieval performance of tags and CVs compare using a large-scale and realistic test collection under carefully controlled circumstances?

8

Methodology

! How do we evaluate retrieval performance?

•  Large collection of documents

•  Realistic information needs (= topics)

•  Relevance judgments (= relevant documents for topics)

12

Methodology

•  Book collection -  Controlled vocabularies in library catalogs

-  Tags in social cataloging sites

23

INEX Test Collection of Book Records

User-Generated Content (UGC)

Bibliographic Metadata (Core)

Author Title Publication year Publisher

Reviews Tags

Controlled Vocabulary Content (CV)

DDC class labels Amazon subjects Amazon geographic names Amazon category labels

DDC class labels LCSH topical terms Geographic names Personal names Chronological terms Genre/form terms

DDC class labels LCSH topical terms Geographic names Personal names Chronological terms Genre/form terms

Methodology

•  Book collection -  Controlled vocabularies in library catalogs

-  Tags in social cataloging sites

•  Book search information needs -  LibraryThing fora

27

Annotated LT topic Group name

Topic title

Narrative

Methodology

•  Book collection -  Controlled vocabularies in library catalogs

-  Tags in social cataloging sites

•  Book search information needs -  LibraryThing fora

•  Book search relevance judgements -  LibraryThing fora

31

Annotated LT topic Group name

Topic title

Narrative

Recommended books

37

Experimental setup

•  INEX Test collection for book records -  Any-CV = 2 mio. records -  Each-CV = 350,000 records

•  LibraryThing forum topics -  Query and Narrative representations -  640 different topics split in half for training the IR system and testing

•  Relevance judgements: recommendations from LT members -  with graded relevance scoring (highest relevance if book is added by

searcher) •  Evaluation metric: Normalized Discounted Cumulated Gain

(NDCG@10) -  Evaluated for the first 10 results of search output -  Scores range between 0.0 and 1.0

39

Comparing controlled vocabulary sources

•  Question -  Which of the three sources of controlled vocabulary provides the best

performance?

•  Answer -  No significant differences in performance for the different providers or

their combination

-  Amazon is not better or worse than British Library or Library of Congress

-  Subsequent experiments combine all CV sources

40

Comparing metadata elements

•  Questions -  Which of the metadata elements provides the best stand-alone performance?

-  Which metadata element performs better: tags or CV?

-  Which combination of metadata elements provides the best performance?

41

0.00!

0.01!

0.02!

0.03!

0.04!

0.05!

0.06!

0.07!

0.08!

0.09!

0.10!

0.11!

0.12!

0.13!

Core!Controlled vocabulary!

Reviews!

Tags!User-generated content!

Core + Controlled vocabulary!

Core + Reviews!

Core + Tags!

Core + User-generated content!

All fields!

Query!Narrative!

Core Controlled vocabulary Reviews Tags

Core + Controlled vocabulary

Core + Reviews

Core + Tags

Core + User-generated

contentAll fields

User-generated

content

Results of (combinations of) element sets per topic representation

45

Comparing metadata elements

•  Answers -  Reviews are the best performing metadata elements by far

‣  Significantly so compared to all other individual metadata elements

-  Combining metadata elements nearly always outperforms individual elements

‣  All metadata elements combined provide the best overall performance

-  Slight advantage of tags over CV (but not significantly)

46

Follow-up analysis (1)

•  Question -  What is the nature of the difference between tags and CV: do they

complement each other or overlap?

48

-1.0!

-0.9!

-0.8!

-0.7!

-0.6!

-0.5!

-0.4!

-0.3!

-0.2!

-0.1!

0.0!

0.1!

0.2!

0.3!

0.4!

0.5!

0.6!

0.7!

0.8!

0.9!

1.0!

1! 21! 41! 61! 81! 101! 121! 141! 161! 181! 201! 221! 241! 261! 281! 301! 321!

-1.0!

-0.9!

-0.8!

-0.7!

-0.6!

-0.5!

-0.4!

-0.3!

-0.2!

-0.1!

0.0!

0.1!

0.2!

0.3!

0.4!

0.5!

0.6!

0.7!

0.8!

0.9!

1.0!

1! 26! 51! 76! 101! 126! 151! 176! 201! 226! 251! 276! 301! 326!

Per-topic differences (Tags vs. controlled vocabulary)

0.00!

0.01!

0.02!

0.03!

0.04!

0.05!

0.06!

0.07!

Que

ry -

CV

- Fic

tion!

Que

ry -

CV

- Non

-fict

ion!

Que

ry -

Tags

- Fi

ctio

n!

Que

ry -

Tags

- N

on-fi

ctio

n!

Query!Narrative!

Δ N

DC

G@

10

Tags > CV

CV > tags

Per-topic differences (Tags vs. controlled vocabularies)

•  Answer -  Tags and CVs outperform each other on different topic sets, offering

complementary performance

49

Follow-up analysis (2)

•  Question -  Does the book type influence the relative performance of tags vs. CV?

‣  Fiction

‣  Non-fiction

51

Fiction vs. non-fiction

0.00!

0.01!

0.02!

0.03!

0.04!

0.05!

0.06!

0.07!

Que

ry -

CV

- Fic

tion!

Que

ry -

CV

- Non

-fict

ion!

Que

ry -

Tags

- Fi

ctio

n!

Que

ry -

Tags

- N

on-fi

ctio

n!

Query!Narrative!

Fiction Non-fiction

Controlled vocabulary

Fiction Non-fiction

Tags•  Answer -  Advantage of tags over CV terms is most pronounced for fiction book

requests (but never significantly so)

-  Retrieving relevant non-fiction books is easier than retrieving relevant fiction books

52

Follow-up analysis (3)

•  Question -  Does the type of information need influence the relative performance of tags

vs. CV?

‣  Search

‣  Recommendation

‣  Search + Recommendation

‣  Known-item

53

Follow-up analysis (3)

•  Answers -  Tags are better at satisfying

known-item needs and mixes of search & recommendation aspects

-  CV is better for pure recommendation needs

-  Differences are indicative, but not significant

0.00!

0.01!

0.02!

0.03!

0.04!

0.05!

0.06!

0.07!

0.08!

0.09!

0.10!

S!

S+R!

Controlled vocabulary!

Tags!

0.00!

0.01!

0.02!

0.03!

0.04!

0.05!

0.06!

0.07!

0.08!

0.09!

0.10!

S!

S+R!

Controlled vocabulary!

Tags!

Query Narrative

SearchSearch +

Recommendation

Recommendation

Known-item

SearchSearch +

Recommendation

Recommendation

Known-item

55

Conclusions

•  Tags have a slight (but not significant) advantage over CV

•  Tags and CV provide largely complementary performance

•  Future work -  Detailed analysis of precision/recall effect of tags vs. CV

‣  CV contains more unique terms, tags more repetition of terms

‣  Possible consequence: CVs boost recall, tags boost precision

-  Detailed analysis of which types of tags/CV match relevant documents

-  More detailed analysis of request types and their relation to tag/CV performance

Questions? Comments? Suggestions?