25
Content-based Clustering for Tag Cloud Visualization ASONAM 2009 Arkaitz Zubiaga Alberto P. Garc´ ıa-Plaza ıctor Fresno Raquel Mart´ ınez NLP & IR Group @ UNED July 21st, 2009

Content-based Clustering for Tag Cloud Visualization

Embed Size (px)

DESCRIPTION

My presentation at ASONAM 2009 on July 21st, 2009

Citation preview

Page 1: Content-based Clustering for Tag Cloud Visualization

Content-based Clustering for Tag Cloud VisualizationASONAM 2009

Arkaitz ZubiagaAlberto P. Garcıa-Plaza

Vıctor FresnoRaquel Martınez

NLP & IR Group @ UNED

July 21st, 2009

Page 2: Content-based Clustering for Tag Cloud Visualization

Introduction

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 2 / 25

Page 3: Content-based Clustering for Tag Cloud Visualization

Introduction

Simple Tagging

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 3 / 25

Page 4: Content-based Clustering for Tag Cloud Visualization

Introduction

Collaborative Tagging

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 4 / 25

Page 5: Content-based Clustering for Tag Cloud Visualization

Introduction

Tag Cloud

No organization.

No relations between tags.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 5 / 25

Page 6: Content-based Clustering for Tag Cloud Visualization

Introduction

Our Work

Find relations between tags to organize them:

To ease visualization and search.To ease subscribing to a group of related tags.

Previous works rely on tag co-occurrence to find relations.

What about considering web documents’ content?

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 6 / 25

Page 7: Content-based Clustering for Tag Cloud Visualization

Dataset Generation

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 7 / 25

Page 8: Content-based Clustering for Tag Cloud Visualization

Dataset Generation

Dataset Generation

Starting point: 140 most popular tags on Delicious (T140, tag cloud).

Tag monitoring: ∼6.000 documents/tag (∼840.000 docs., html andpdf).

Data retrieval:

Tag data for each document.Document content.

Filtering: English-written documents with tag data available.

Result: 144.574 documents (unbalanced).

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 8 / 25

Page 9: Content-based Clustering for Tag Cloud Visualization

Our Method

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 9 / 25

Page 10: Content-based Clustering for Tag Cloud Visualization

Our Method

Representation

Most relevant tags for each document: at least, 40,7% of the top tag

Merge documents pertaining to each T140 tag.

Stopwords removal.

Stemming.

TF-IDF representation (reducing by DF).

1 vector/tag.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 10 / 25

Page 11: Content-based Clustering for Tag Cloud Visualization

Our Method

Clustering (SOM)

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 11 / 25

Page 12: Content-based Clustering for Tag Cloud Visualization

Our Method

Clustering Settings

12x12 sized map: 144 neurons.

vectors with 17.518 dimensions.

Learning rate: 0,1.

Neighborhood: 12.

Iterations: 50.000.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 12 / 25

Page 13: Content-based Clustering for Tag Cloud Visualization

Our Method

Terminology Extraction

Merge all the documents in each neuron.

Terminology extraction for each neuron.

Representative for the neuron, but not for the rest.Language models (KLD, Kullback-Leibler Divergence).

Result: Representative terms for each neuron.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 13 / 25

Page 14: Content-based Clustering for Tag Cloud Visualization

Results

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 14 / 25

Page 15: Content-based Clustering for Tag Cloud Visualization

Results

Results

Full map available at: http://nlp.uned.es/social-tagging/

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 15 / 25

Page 16: Content-based Clustering for Tag Cloud Visualization

Results

Results: Computer Science

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 16 / 25

Page 17: Content-based Clustering for Tag Cloud Visualization

Results

Results: Design

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 17 / 25

Page 18: Content-based Clustering for Tag Cloud Visualization

Results

Results: Cooking

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 18 / 25

Page 19: Content-based Clustering for Tag Cloud Visualization

Results

Results: Coherence

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 19 / 25

Page 20: Content-based Clustering for Tag Cloud Visualization

Results

Results: Terminology

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 20 / 25

Page 21: Content-based Clustering for Tag Cloud Visualization

Conclusions

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 21 / 25

Page 22: Content-based Clustering for Tag Cloud Visualization

Conclusions

Conclusions

We analyzed tag clustering and terminology extraction relying ondocuments’ content.

We collected the DeliciousT140 dataset.

Unlike previous works, we considered documents’ content.

The resulting map shows encouraging results, exhibiting the potentialof collaborative tagging systems.

It could allow community discovery.

It eases tag cloud visualization, as well as improving navigation andsubscribing.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 22 / 25

Page 23: Content-based Clustering for Tag Cloud Visualization

Future Work

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 23 / 25

Page 24: Content-based Clustering for Tag Cloud Visualization

Future Work

Future Work

To compare our content-based approach to those based on tagco-occurrence.

To make a quantitative evaluation

To semantically analyze tags (polysemy, synonimy,...).

To extend the work to multilingual tag sets.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 24 / 25

Page 25: Content-based Clustering for Tag Cloud Visualization

Future Work

Thank You for Your Attention

Achiu Arigato Danke Dhannvaad Dua Netjer en ek EfcharistoGracias Gracies Gratia Grazie Guishepeli Hvala Kiitos

Koszonom Merce Merci Mila esker Obrigado ShukranShukriya Tack Tak Takk Tanan Tapadh leat Tesekkur ederim Thank

you Toda

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 25 / 25