27
An Integrated Approach to Discover Tag Semantics SAC 2011, Web Technologies Track, March 24 th 2011 Antonina Dattolo University of Udine Department of Mathematics and Computer Science [email protected] Davide Eynard USI - University of Lugano ITC - Institute for Communication Technologies [email protected] Luca Mazzola USI - University of Lugano ITC - Institute for Communication Technologies [email protected]

An integrated approach to discover tag semantics

Embed Size (px)

DESCRIPTION

Talk presentation at SAC 2011. From the paper abstract: "Tag-based systems have become very common for online classification thanks to their intrinsic advantages such as self-organization and rapid evolution. However, they are still affected by some issues that limit their utility, mainly due to the inherent ambiguity in the semantics of tags. Synonyms, homonyms, and polysemous words, while not harmful for the casual user, strongly affect the quality of search results and the performances of tag-based recommendationsystems. In this paper we rely on the concept of tag relatedness in order to study small groups of similar tags and detect relationships between them. This approach is grounded on a model that builds upon an edge-colored multigraph of users, tags, and resources. To put our thoughts in practice, we present a modular and extensible framework of analysis for discovering synonyms, homonyms and hierarchical relationships amongst sets of tags. Some initial results of its application to the delicious database are presented, showing that such an approach could be useful to solve some of the well known problems of folksonomies.

Citation preview

Page 1: An integrated approach to discover tag semantics

An Integrated Approachto Discover Tag Semantics

SAC 2011, Web Technologies Track, March 24th 2011

Antonina DattoloUniversity of Udine

Department of Mathematicsand Computer Science

[email protected]

Davide EynardUSI - University of Lugano

ITC - Institute forCommunication Technologies

[email protected]

Luca MazzolaUSI - University of Lugano

ITC - Institute forCommunication Technologies

[email protected]

Page 2: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 2/27

Talk outline

Properties of tags

Folksonomies as edge-colored multigraphs

Framework design and implementation

Tests and evaluations

Conclusions

Page 3: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 3/27

Tags properties

Tags: are democratic and bottom-up (vs hierarchical) are inclusive and current follow desire lines are easy to use

Page 4: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 4/27

Tags cons

Lexical ambiguities: Synonyms

game and juego, or web2.0 and web_2

Homonyms check as in chess and in “to check” (polysemous)

sf as scifi or san_francisco

Basic level variations dog and poodle

Ambiguities due to different purposes: blog to tag a blog software (i.e. Wordpress), a blog service, a blog

post, something to blog later, ...

Page 5: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 5/27

Advantages of disambiguation

Synonym detection: increases recall allows for better recommendation systems

Homonym detection: allows to find different contexts of use increases precision

Basic level variations detection: identifies a hierarchy increases recall (i.e. automatically searching for subclasses) provides a mean to browse search results

Page 6: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 6/27

Approaches to tag disambiguation

Roughly two main families of approaches Theoretical ones, aiming at describing the system as a

whole More practical, ad-hoc ones (often addressing one or few

issues at a time)

Our approach Main assumption: lexical ambiguities are not independent

from each other Solution based on

a theoretical framework a modular, extensible analysis tool

Page 7: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 7/27

Folksonomies as edge-colored multigraphs

Def.1: An edge-colored multigraph is a triple

ECMG = (MG, C, c)

where: MG = (V,E,f) is a multigraph C is a set of colors c : E→C is an assignment of colors to multigraph edges

Def.2: A personomy related to user u is a non-directed edge-colored graph of color C

u:

Pu = (T, R, E, C

u)

Page 8: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 8/27

Folksonomies as edge-colored multigraphs

Def.3: Given a set of users U and the family of personomies P

u (u ∈U), a folksonomy is defined as

that is, an edge-colored multigraph where:

vertices are tags + resources edges are tag assignments made on

resources by each user every color is a different user

Def.3: Given a set of users U and the family of personomies P

u (u ∈U), a folksonomy is defined as

that is, an edge-colored multigraph where:

Page 9: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 9/27

First simplification step

As we are only interested in relationships between tags, we need to perform two simplification steps on the edge-colored multigraph

Step 1: colored edges are collapsed and substituted by weighted edges

potentially, every color (user) might be assigned a different weight w

u

the weight w of the collapsed edge is the sum of all the w

u linking the same two vertices

when wu= 1 for each user, w = times a tag is

used on a resource

Page 10: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 10/27

Second simplification step

Step 2: a link is created between t

a and t

b if they

share a resource

resource nodes are dropped

Edges' weights can be calculated in different ways:

number of triples (ti ,r,t

j ) where (t

i ,r), (r,t

j ) ∈E

=> co-occurrence

normalized co-occurrence (i.e. Using the Jaccard index)

distributional measures

custom metrics (i.e. sum of products of connecting edges' weights) =>

Page 11: An integrated approach to discover tag semantics

11/27

The whole process at a glance

1 2

3 4

Page 12: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 12/27

System architecture

Basic assumption: ambiguous tags should be related (either by cooccurrence or

by presence in the same context)

Three main components: tag analysis tool disambiguation tool front-end

Page 13: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 13/27

Synonyms detection / 1

Natural text … Two words are considered synonyms if they can be replaced

by each other without affecting the meaning of a sentence

… vs. Tag-based systems It is possible to swap two tags within a “sentence” (i.e. a

tagging action) without affecting its meaning when we have: variations of a word (i.e. blog, blogs, blogging)

translations into other languages (i.e. game, juego, spiel)

terms joined by non-alphabetic characters (i.e. web2, web_2)

No “one size fits all” solution

Page 14: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 14/27

Synonyms detection / 2

A modular solution for synonyms detection: different heuristics, each one returning the likelihood of tags to be

synonyms

results are weighted to obtain an overall likelihood

Suggested heuristics: an edit distance such as Levenshtein's (normalized to account for short

strings);

synonym search in WordNet (good precision, low recall);

online translation bases (top-down, such as dictionaries, or bottom-up, collaboratively grown vocabs like Wikipedia)

stemming with NLP algorithms

Page 15: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 15/27

Homonyms detection

Check if the tag t has been used in different contexts cluster tags related to t in groups the most frequent tags in these groups are used to name

and disambiguate the contexts

Clustering algorithm: an overlapping one, also used in social network analysis* a cluster is a subgraph G identified by the maximization of a

fitness property

* A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”

s = strength of internal (in) or external (out) links

α = tweaking parameter

Page 16: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 16/27

Hierarchy detection

Hierarchy is a specific case of basic level variation

A possible approach: Hearst patterns on the Web, such as:

C1 (and|or) other C

2 (i.e. “poodles and other dogs”)

C1 such as I (i.e. “cities such as San Francisco”)

(note: Ci are concepts, I is a concept instance)

Search for the patterns, and use the number of results as an indicator for their strength

Pros: the Web is as up-to-date as folksonomies

Cons: O(n2) complexity, not really scalable

Page 17: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 17/27

Prototype development

Prototype Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a

batch job and saves matrices in the DB)

Disambiguation with homonyms plugin, implementing the overlapping clustering algorithm, and Wikipedia synonym discovery

Front-end is currently a command-line application

Dataset Data from more than 30K users of

http://www.delicious.com Ignored the system:unfiled tag For the calculation of Tag Context Similarity,

we only took into account the top 10K tags

Page 18: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 18/27

Experimental results / 1

System tested against three different sets of tags: Top 20 tags in delicious

A group of tags known to be ambiguous (apple, cambridge, sf, stream, turkey, tube)

A set of subjective tags, chosen between the most popular ones in delicious (cool, fun, funny, interesting, toread)

For each tag: we calculated the top n (with n = 50) related tags with the three metrics

(CO, NCO, TCS)

we performed synonym and homonym analyses

Page 19: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 19/27

Experimental results / 2

Tag Context Similarity already tends to provide synonyms as top-related tags

i.e. toread related: read, read_later, to_read, etc.

Analyzing a less popular synonym (@readit): 9 out of the top 10 (and 17 out of the top 50) related tags are synonyms

reason: as less popular tags are less spread across contexts, they tend to have a higher similarity with other less popular synonyms

Wikipedia results: analyzing the 31 tags in our three sets, we got 215 new words;

of those 215, only 83 are valid tags in our delicious dataset;

of those 83, only 20 belong to the 10K most-used tags;

only 2 belong to the set of the top-related tags of their English synonym.

Page 20: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 20/27

Experimental results / 3

Homonyms detection: we tested the algorithm with

different values of α

meaningful results in a relatively short time (but we are working only on the top related tags...)

limit: the graphs of top related tags differ in connectivity, so there is not a value of α that is good for all of them (α

sf=1.4,

αstream

=1.74).

Page 21: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 21/27

Conclusions

Model Flexible enough to support other kind of metrics Multigraph can be simplified in other ways User-related weights still have to be taken into account

Tool Still in prototypal phase, but already provided useful results

and allowed us to compare metrics: different metrics provide very different results, that might be

more or less useful according to the user needs tag behaviors: different depending on their popularity and the use

that people do of them

Page 22: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 22/27

Conclusions

Ongoing work Clustering evaluation metrics to find best α Applications (i.e. for tag grouping and visualization*) User- and resource-specific projections**

Future work Development of other plugins and front-end Play with user-related weights to focus on specific

communities / filter spam

* Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data interpretation”.

** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"

Page 23: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 23/27

Thank you!

Thanks for your attention!

Questions?

Page 24: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 24/27

toread top 20 related tags

Page 25: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 25/27

@readit top 20 related tags

Page 26: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 26/27

sf top 20 related tags

Page 27: An integrated approach to discover tag semantics

24/03/2011 An integrated approach to discover tag semantics 27/27

stream top 20 related tags