Download pdf - An integrated approach to discover tag semantics

An Integrated Approachto Discover Tag Semantics

SAC 2011, Web Technologies Track, March 24th 2011

Antonina DattoloUniversity of Udine

Department of Mathematicsand Computer Science

[email protected]

Davide EynardUSI - University of Lugano

ITC - Institute forCommunication Technologies

[email protected]

Luca MazzolaUSI - University of Lugano

ITC - Institute forCommunication Technologies

[email protected]

24/03/2011 An integrated approach to discover tag semantics 2/27

Talk outline

Properties of tags

Folksonomies as edge-colored multigraphs

Framework design and implementation

Tests and evaluations

Conclusions


Tags properties

Tags: are democratic and bottom-up (vs hierarchical) are inclusive and current follow desire lines are easy to use


Tags cons

Lexical ambiguities: Synonyms

game and juego, or web2.0 and web_2

Homonyms check as in chess and in “to check” (polysemous)

sf as scifi or san_francisco

Basic level variations dog and poodle

Ambiguities due to different purposes: blog to tag a blog software (i.e. Wordpress), a blog service, a blog

post, something to blog later, ...


Advantages of disambiguation

Synonym detection: increases recall allows for better recommendation systems

Homonym detection: allows to find different contexts of use increases precision

Basic level variations detection: identifies a hierarchy increases recall (i.e. automatically searching for subclasses) provides a mean to browse search results


Approaches to tag disambiguation

Roughly two main families of approaches Theoretical ones, aiming at describing the system as a

whole More practical, ad-hoc ones (often addressing one or few

issues at a time)

Our approach Main assumption: lexical ambiguities are not independent

from each other Solution based on

a theoretical framework a modular, extensible analysis tool



Def.1: An edge-colored multigraph is a triple

ECMG = (MG, C, c)

where: MG = (V,E,f) is a multigraph C is a set of colors c : E→C is an assignment of colors to multigraph edges

Def.2: A personomy related to user u is a non-directed edge-colored graph of color C

u:

Pu = (T, R, E, C

u)



Def.3: Given a set of users U and the family of personomies P

u (u ∈U), a folksonomy is defined as

that is, an edge-colored multigraph where:

vertices are tags + resources edges are tag assignments made on

resources by each user every color is a different user

Def.3: Given a set of users U and the family of personomies P

u (u ∈U), a folksonomy is defined as

that is, an edge-colored multigraph where:


First simplification step

As we are only interested in relationships between tags, we need to perform two simplification steps on the edge-colored multigraph

Step 1: colored edges are collapsed and substituted by weighted edges

potentially, every color (user) might be assigned a different weight w

u

the weight w of the collapsed edge is the sum of all the w

u linking the same two vertices

when wu= 1 for each user, w = times a tag is

used on a resource


Second simplification step

Step 2: a link is created between t

a and t

b if they

share a resource

resource nodes are dropped

Edges' weights can be calculated in different ways:

number of triples (ti ,r,t

j ) where (t

i ,r), (r,t

j ) ∈E

=> co-occurrence

normalized co-occurrence (i.e. Using the Jaccard index)

distributional measures

custom metrics (i.e. sum of products of connecting edges' weights) =>

11/27

The whole process at a glance

1 2

3 4


System architecture

Basic assumption: ambiguous tags should be related (either by cooccurrence or

by presence in the same context)

Three main components: tag analysis tool disambiguation tool front-end


Synonyms detection / 1

Natural text … Two words are considered synonyms if they can be replaced

by each other without affecting the meaning of a sentence

… vs. Tag-based systems It is possible to swap two tags within a “sentence” (i.e. a

tagging action) without affecting its meaning when we have: variations of a word (i.e. blog, blogs, blogging)

translations into other languages (i.e. game, juego, spiel)

terms joined by non-alphabetic characters (i.e. web2, web_2)

No “one size fits all” solution


Synonyms detection / 2

A modular solution for synonyms detection: different heuristics, each one returning the likelihood of tags to be

synonyms

results are weighted to obtain an overall likelihood

Suggested heuristics: an edit distance such as Levenshtein's (normalized to account for short

strings);

synonym search in WordNet (good precision, low recall);

online translation bases (top-down, such as dictionaries, or bottom-up, collaboratively grown vocabs like Wikipedia)

stemming with NLP algorithms


Homonyms detection

Check if the tag t has been used in different contexts cluster tags related to t in groups the most frequent tags in these groups are used to name

and disambiguate the contexts

Clustering algorithm: an overlapping one, also used in social network analysis* a cluster is a subgraph G identified by the maximization of a

fitness property

* A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”

s = strength of internal (in) or external (out) links

α = tweaking parameter


Hierarchy detection

Hierarchy is a specific case of basic level variation

A possible approach: Hearst patterns on the Web, such as:

C1 (and|or) other C

2 (i.e. “poodles and other dogs”)

C1 such as I (i.e. “cities such as San Francisco”)

(note: Ci are concepts, I is a concept instance)

Search for the patterns, and use the number of results as an indicator for their strength

Pros: the Web is as up-to-date as folksonomies

Cons: O(n2) complexity, not really scalable


Prototype development

Prototype Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a

batch job and saves matrices in the DB)

Disambiguation with homonyms plugin, implementing the overlapping clustering algorithm, and Wikipedia synonym discovery

Front-end is currently a command-line application

Dataset Data from more than 30K users of

http://www.delicious.com Ignored the system:unfiled tag For the calculation of Tag Context Similarity,

we only took into account the top 10K tags


Experimental results / 1

System tested against three different sets of tags: Top 20 tags in delicious

A group of tags known to be ambiguous (apple, cambridge, sf, stream, turkey, tube)

A set of subjective tags, chosen between the most popular ones in delicious (cool, fun, funny, interesting, toread)

For each tag: we calculated the top n (with n = 50) related tags with the three metrics

(CO, NCO, TCS)

we performed synonym and homonym analyses



Tag Context Similarity already tends to provide synonyms as top-related tags

i.e. toread related: read, read_later, to_read, etc.

Analyzing a less popular synonym (@readit): 9 out of the top 10 (and 17 out of the top 50) related tags are synonyms

reason: as less popular tags are less spread across contexts, they tend to have a higher similarity with other less popular synonyms

Wikipedia results: analyzing the 31 tags in our three sets, we got 215 new words;

of those 215, only 83 are valid tags in our delicious dataset;

of those 83, only 20 belong to the 10K most-used tags;

only 2 belong to the set of the top-related tags of their English synonym.



Homonyms detection: we tested the algorithm with

different values of α

meaningful results in a relatively short time (but we are working only on the top related tags...)

limit: the graphs of top related tags differ in connectivity, so there is not a value of α that is good for all of them (α

sf=1.4,

αstream

=1.74).


Conclusions

Model Flexible enough to support other kind of metrics Multigraph can be simplified in other ways User-related weights still have to be taken into account

Tool Still in prototypal phase, but already provided useful results

and allowed us to compare metrics: different metrics provide very different results, that might be

more or less useful according to the user needs tag behaviors: different depending on their popularity and the use

that people do of them


Conclusions

Ongoing work Clustering evaluation metrics to find best α Applications (i.e. for tag grouping and visualization*) User- and resource-specific projections**

Future work Development of other plugins and front-end Play with user-related weights to focus on specific

communities / filter spam

* Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data interpretation”.

** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"


Thank you!

Thanks for your attention!

Questions?


toread top 20 related tags


@readit top 20 related tags


sf top 20 related tags


stream top 20 related tags