An Integrated Approachto Discover Tag Semantics
SAC 2011, Web Technologies Track, March 24th 2011
Antonina DattoloUniversity of Udine
Department of Mathematicsand Computer Science
Davide EynardUSI - University of Lugano
ITC - Institute forCommunication Technologies
Luca MazzolaUSI - University of Lugano
ITC - Institute forCommunication Technologies
24/03/2011 An integrated approach to discover tag semantics 2/27
Talk outline
Properties of tags
Folksonomies as edge-colored multigraphs
Framework design and implementation
Tests and evaluations
Conclusions
24/03/2011 An integrated approach to discover tag semantics 3/27
Tags properties
Tags: are democratic and bottom-up (vs hierarchical) are inclusive and current follow desire lines are easy to use
24/03/2011 An integrated approach to discover tag semantics 4/27
Tags cons
Lexical ambiguities: Synonyms
game and juego, or web2.0 and web_2
Homonyms check as in chess and in “to check” (polysemous)
sf as scifi or san_francisco
Basic level variations dog and poodle
Ambiguities due to different purposes: blog to tag a blog software (i.e. Wordpress), a blog service, a blog
post, something to blog later, ...
24/03/2011 An integrated approach to discover tag semantics 5/27
Advantages of disambiguation
Synonym detection: increases recall allows for better recommendation systems
Homonym detection: allows to find different contexts of use increases precision
Basic level variations detection: identifies a hierarchy increases recall (i.e. automatically searching for subclasses) provides a mean to browse search results
24/03/2011 An integrated approach to discover tag semantics 6/27
Approaches to tag disambiguation
Roughly two main families of approaches Theoretical ones, aiming at describing the system as a
whole More practical, ad-hoc ones (often addressing one or few
issues at a time)
Our approach Main assumption: lexical ambiguities are not independent
from each other Solution based on
a theoretical framework a modular, extensible analysis tool
24/03/2011 An integrated approach to discover tag semantics 7/27
Folksonomies as edge-colored multigraphs
Def.1: An edge-colored multigraph is a triple
ECMG = (MG, C, c)
where: MG = (V,E,f) is a multigraph C is a set of colors c : E→C is an assignment of colors to multigraph edges
Def.2: A personomy related to user u is a non-directed edge-colored graph of color C
u:
Pu = (T, R, E, C
u)
24/03/2011 An integrated approach to discover tag semantics 8/27
Folksonomies as edge-colored multigraphs
Def.3: Given a set of users U and the family of personomies P
u (u ∈U), a folksonomy is defined as
that is, an edge-colored multigraph where:
vertices are tags + resources edges are tag assignments made on
resources by each user every color is a different user
Def.3: Given a set of users U and the family of personomies P
u (u ∈U), a folksonomy is defined as
that is, an edge-colored multigraph where:
24/03/2011 An integrated approach to discover tag semantics 9/27
First simplification step
As we are only interested in relationships between tags, we need to perform two simplification steps on the edge-colored multigraph
Step 1: colored edges are collapsed and substituted by weighted edges
potentially, every color (user) might be assigned a different weight w
u
the weight w of the collapsed edge is the sum of all the w
u linking the same two vertices
when wu= 1 for each user, w = times a tag is
used on a resource
24/03/2011 An integrated approach to discover tag semantics 10/27
Second simplification step
Step 2: a link is created between t
a and t
b if they
share a resource
resource nodes are dropped
Edges' weights can be calculated in different ways:
number of triples (ti ,r,t
j ) where (t
i ,r), (r,t
j ) ∈E
=> co-occurrence
normalized co-occurrence (i.e. Using the Jaccard index)
distributional measures
custom metrics (i.e. sum of products of connecting edges' weights) =>
11/27
The whole process at a glance
1 2
3 4
24/03/2011 An integrated approach to discover tag semantics 12/27
System architecture
Basic assumption: ambiguous tags should be related (either by cooccurrence or
by presence in the same context)
Three main components: tag analysis tool disambiguation tool front-end
24/03/2011 An integrated approach to discover tag semantics 13/27
Synonyms detection / 1
Natural text … Two words are considered synonyms if they can be replaced
by each other without affecting the meaning of a sentence
… vs. Tag-based systems It is possible to swap two tags within a “sentence” (i.e. a
tagging action) without affecting its meaning when we have: variations of a word (i.e. blog, blogs, blogging)
translations into other languages (i.e. game, juego, spiel)
terms joined by non-alphabetic characters (i.e. web2, web_2)
No “one size fits all” solution
24/03/2011 An integrated approach to discover tag semantics 14/27
Synonyms detection / 2
A modular solution for synonyms detection: different heuristics, each one returning the likelihood of tags to be
synonyms
results are weighted to obtain an overall likelihood
Suggested heuristics: an edit distance such as Levenshtein's (normalized to account for short
strings);
synonym search in WordNet (good precision, low recall);
online translation bases (top-down, such as dictionaries, or bottom-up, collaboratively grown vocabs like Wikipedia)
stemming with NLP algorithms
24/03/2011 An integrated approach to discover tag semantics 15/27
Homonyms detection
Check if the tag t has been used in different contexts cluster tags related to t in groups the most frequent tags in these groups are used to name
and disambiguate the contexts
Clustering algorithm: an overlapping one, also used in social network analysis* a cluster is a subgraph G identified by the maximization of a
fitness property
* A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”
s = strength of internal (in) or external (out) links
α = tweaking parameter
24/03/2011 An integrated approach to discover tag semantics 16/27
Hierarchy detection
Hierarchy is a specific case of basic level variation
A possible approach: Hearst patterns on the Web, such as:
C1 (and|or) other C
2 (i.e. “poodles and other dogs”)
C1 such as I (i.e. “cities such as San Francisco”)
(note: Ci are concepts, I is a concept instance)
Search for the patterns, and use the number of results as an indicator for their strength
Pros: the Web is as up-to-date as folksonomies
Cons: O(n2) complexity, not really scalable
24/03/2011 An integrated approach to discover tag semantics 17/27
Prototype development
Prototype Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a
batch job and saves matrices in the DB)
Disambiguation with homonyms plugin, implementing the overlapping clustering algorithm, and Wikipedia synonym discovery
Front-end is currently a command-line application
Dataset Data from more than 30K users of
http://www.delicious.com Ignored the system:unfiled tag For the calculation of Tag Context Similarity,
we only took into account the top 10K tags
24/03/2011 An integrated approach to discover tag semantics 18/27
Experimental results / 1
System tested against three different sets of tags: Top 20 tags in delicious
A group of tags known to be ambiguous (apple, cambridge, sf, stream, turkey, tube)
A set of subjective tags, chosen between the most popular ones in delicious (cool, fun, funny, interesting, toread)
For each tag: we calculated the top n (with n = 50) related tags with the three metrics
(CO, NCO, TCS)
we performed synonym and homonym analyses
24/03/2011 An integrated approach to discover tag semantics 19/27
Experimental results / 2
Tag Context Similarity already tends to provide synonyms as top-related tags
i.e. toread related: read, read_later, to_read, etc.
Analyzing a less popular synonym (@readit): 9 out of the top 10 (and 17 out of the top 50) related tags are synonyms
reason: as less popular tags are less spread across contexts, they tend to have a higher similarity with other less popular synonyms
Wikipedia results: analyzing the 31 tags in our three sets, we got 215 new words;
of those 215, only 83 are valid tags in our delicious dataset;
of those 83, only 20 belong to the 10K most-used tags;
only 2 belong to the set of the top-related tags of their English synonym.
24/03/2011 An integrated approach to discover tag semantics 20/27
Experimental results / 3
Homonyms detection: we tested the algorithm with
different values of α
meaningful results in a relatively short time (but we are working only on the top related tags...)
limit: the graphs of top related tags differ in connectivity, so there is not a value of α that is good for all of them (α
sf=1.4,
αstream
=1.74).
24/03/2011 An integrated approach to discover tag semantics 21/27
Conclusions
Model Flexible enough to support other kind of metrics Multigraph can be simplified in other ways User-related weights still have to be taken into account
Tool Still in prototypal phase, but already provided useful results
and allowed us to compare metrics: different metrics provide very different results, that might be
more or less useful according to the user needs tag behaviors: different depending on their popularity and the use
that people do of them
24/03/2011 An integrated approach to discover tag semantics 22/27
Conclusions
Ongoing work Clustering evaluation metrics to find best α Applications (i.e. for tag grouping and visualization*) User- and resource-specific projections**
Future work Development of other plugins and front-end Play with user-related weights to focus on specific
communities / filter spam
* Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data interpretation”.
** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"
24/03/2011 An integrated approach to discover tag semantics 23/27
Thank you!
Thanks for your attention!
Questions?
24/03/2011 An integrated approach to discover tag semantics 24/27
toread top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 25/27
@readit top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 26/27
sf top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 27/27
stream top 20 related tags