Upload
joseba-abaitua
View
3.172
Download
3
Tags:
Embed Size (px)
Citation preview
Towards a solution to extract knowledge from the social web
(“metadata first, ontologies second”)
Project Collaborative Ontology Building System (CollOnBus)
INTEK Nets 2005-2007
Aitor Almeida, Borja Sotomayor,
Joseba Abaitua, Diego Lopez de Ipiña
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Social web: source of Social web: source of knowledgeknowledge
Crowds share and tag resources of different types: – pictures, music, posts, videoclips, slides, books,
bookmarks, etc.
Social tagging (or crowd-tagging) is a very effective and economic way of generating knowledge
Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ”
<http://en.wikipedia.org/wiki/Crowdsourcing>
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Related work Related work (since 2006)(since 2006)
mapping tags to ontologies Schmitz 2006. Inducing
Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop
Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop
identifying semantic relations Specia, Motta. 2007.
Integrating Folksonomies with the Semantic Web. ESWC2007
transforming folksonomies into formal representations
Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop
Hotho et al. 2006. Trend Detection in Folksonomies. Semantics And Digital Media Technology SAMT2006
Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Which Which knowledge knowledge representationrepresentation model? model?
Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation?
Semantic Networks– Lexical networks (WordNet)
Taxonomines – eg. categories from Wikipedia, Thesauri
Metadata– “mapping to Dublin Core is a weak choice”
Ontologies
“metadata first, ontologies second”
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds tagging pictures
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds tagging pictures
Aitor Almeida
Borja Sotomayo
r
Diego López de
Ipiña
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds tagging pictures
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds tagging posts
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds taggingCrowds tagging slidesslides
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds taggingCrowds tagging booksbooks
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowds taggingCrowds tagging URLURL
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Crowd-sharing of tagsCrowd-sharing of tags
Flickr, del.icio.us... group tags by social sharing (or “co-usage”)– but the semantic information that socially
shared tags acquire is poorly exploited
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Mapping folksonomies Mapping folksonomies into tag clustersinto tag clusters
RawSugar <http://rawsugar.com/>– allows users to assign
hierarchies to their tags, improving the navigation and searching of folksonomies
– non-expert users will find it easier to tag resources without any restrictions
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Tag clusteringTag clustering
TAG clustering is the main technique used to improve the wealth of social tagging– but semantic
relations are not detected
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Beyond tag clusters?
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Should we mapShould we map them intothem into ontologies?ontologies?
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Better mappingBetter mapping 1st1st
into into metadatametadata
blog, japan, personal, spanish, geek
“Kirai.NET”http://kirai.bitacoras.com
Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Metadata vs ontologiesMetadata vs ontologies
Why are metadata structures better than ontologies (for resource classification and categorisation)?
Let’s reflect on different knowledge representations and about who use them:– Folksonomies (crowds)– Taxonomies, ontologies (knowledge
engineers, AI/SW practitioners)– Metadata structures (librarians, archivists,
documentalists)
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
What are metadata?What are metadata?
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG vs metadata?
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Metadata vs ontologiesMetadata vs ontologies
Why are metadata structures better?– Because metadata provide wide and complete
range of facets for representing knowledge about an entity or resource
– Each facet (or data type) could be part of one or several ontological structures
– Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)”
– “A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Better mapping 1st folksonomiesBetter mapping 1st folksonomies into into metadata structuresmetadata structures
blog, japan, personal, spanish, geek
“Kirai.NET”http://kirai.bitacoras.com
Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Dublin CoreDublin Core Metadata Initiative Metadata Initiative
http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Dublin CoreDublin Core Metadata Initiative Metadata Initiative
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Dublin CoreDublin Core Metadata Inicitive Metadata Inicitive
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Our mapping tool:Our mapping tool:folk2ontofolk2onto (? folk2meta)(? folk2meta)
Tagged resource
RS
S/H
TM
L
Tag Retriever
FolksonomyTAGs
Tag Trainer
Tag Distiller
TAG
s
TAG
s
Trained Tags DB
Tra
inin
gT
rain
ing
Filtered Tags
XM
L
Mappings Trainer
Mappings Distiller
XM
L
XM
L
Wordnet
Syn
sets
Wordnet
Syn
sets
Mapping DB
Map
pin
gM
app
ing
Annotated resource
RDF
designed by
Borja Sotomayor
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: Tag Distiller Tag Distiller
Tag Distiller: – Downloads tags from Web 2.0 sites– Matches each tag against WordNet
(taking into account the tag’s context/cloud)
– Filters out synonyms – Keeps the list of remaining tags – Generates an XML file
Implemented by Aitor Almeida
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG clouds TAG clouds fromfrom del.icio.us del.icio.us
1. http://del.icio.us/url/check?url=site2. Looks for <title> and gets its content: the
hash3. Gets the RSS in
http://del.icio.us/rss/url/ + hash
4. Then tag-clouds are downloaded from <rdf:li resource=\"http://del.icio.us/tag/">
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG cloudsTAG clouds from from Technorati Technorati
Technorati: blog aggregator We can get tag clouds from Technoraty
through: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog URL]
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG clouds TAG clouds fromfrom Technorati Technorati
<?xml version="1.0" encoding="utf-8"?> <!-- generator="Technorati API version 1.0 /blogposttags" --><!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN"
"http://api.technorati.com/dtd/tapi-002.xml"> <tapi version="1.0"> <document>
<result> <querycount>13</querycount>
</result> <item>
<tag>christmas cookie recipes</tag> <posts>274</posts>
</item> ….
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Tagged URL Tagged URL atat Technorati Technorati
All <tag> elements are downloaded
To get the “title” http://api.technorati.com/bloginfo?key=[apikey]&url=[blog url]
And<name> is recovered
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
semantic relationssemantic relations in WordNet in WordNet
WordNet relations for tag ‘Spanish’:
Romance,Romance language,Latinian language
Spanish
Mexican Spanish
hypernym
hyponym
national, subject
nation, land, country, a people
Spanish,Spanish people
hyponym
meronym
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG filtering algorithmTAG filtering algorithm
Tags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is
assigned If it has more than one, then
– T: resources tag set– Related(a,b): gives 1 if a and b have some type of relation (hypernym,
hyponym, holonym, meronym)– w: weights
Several iterations are made until a meaning is found (10 iterations max.)
folktagstag
Fttags
tagTtw
F
ttagrelatedw
T
ttagrelatedw
tag
||
),(
||
),(
max
}{
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG filtering algorithmTAG filtering algorithm
Once senses have been discarded, synonyms are also filtered out
Words then are grouped in senses using WordNet’s relation network
The output is exported to a:– XML file with senses– XML file with tags that were discarded– RDF containing WordNet’s relation
network
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG XML fileTAG XML file
<?xml version="1.0" encoding="UTF-8"?><resource>
<tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle><type>Text</type><format>text/html</format><identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier><tags>
<tag><lemma>tune</lemma>
< idlex>236726</idlex></tag><tag>
<lemma>bd</lemma><idlex>5604473</idlex>
</tag>
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
TAG file TAG file without senseswithout senses
<resource><tittle>Wired News: The Virus That Ate DHS</tittle><type>Text</type><format>text/html</format><identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier><tags>
<tag>bit200f06</tag><tag>group141</tag><tag>dhs</tag><tag>group35</tag><tag>malware</tag><tag>group91</tag><tag>group17</tag><tag>group53</tag><tag>computer_security</tag>
</tags></resource>
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
WordNet’s WordNet’s sense setssense sets
Words are grouped in sense sets– If related(a,b) is = 1, then words are
grouped in the same set– The relations depth has to be equal or
smaller than 3
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: Tag Trainer Tag Trainer
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto:Map TrainerMap Trainer
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto: folk2onto: Tag MapperTag Mapper
The Mapper makes tag-element associations
These associations are made according to the senses asigned by the Distiller
Mapping targets into Dublin Core metadata records
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: Dublin Core Dublin Core
The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.):– Title: URL’s title -> from the <title> XML
tag– Type: content type -> depending on the
source (here both are “Text”)– Format: MIME class -> depending on the
source (here we have 2 text/html)– Identifier: we take the resource’s URL
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: Dublin Core Dublin Core
The Tag-Mapper deals with:– Subject: the “topic”.– Language: en, es, fr, de, ru...– Coverage: when, where (about the topic)– Rights: type of licence
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto: folk2onto: mapping formulaemapping formulae
When a TAG has one mapping, that TAG is used If it has more than one:
If it has no mapping, then:
mapMm
wM
mtrelated
||
),(
rel
T
ii
countomapt
ow
T
MTSameOrSonw
tcount
tcount
io
1
),(
)(
)(
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: file mapping file mapping
<rdf:RDF xmlns:j.0="http://purl.org/dc/elements/1.1" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> <rdf:Description rdf:nodeID="A0">
<rdf:type rdf:resource="http://purl.org/dc/elements/1.1identifier"/><j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> <j.0:type>Text</j.0:type> <j.0:format>text/html</j.0:format>
<j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle><j.0:subject>database</j.0:subject>
<j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject>
</rdf:Description></rdf:RDF>
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Mapping trainerMapping trainer
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto: folk2onto: 6 tests (A-F)6 tests (A-F)
Experiment A: Selecting random synsets for the tags. Experiment B: Without any limit in the semantic relation
depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1).
Experiment C: Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0).
Experiment D: Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6).
Experiment E: Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6).
Experiment F: Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto:folk2onto: tests output tests output
Experiment Correct synsets Erroneous synsets
A 706 (%32.5) 1466 (%67.5)
B 1594 (%73.4) 578 (%26.6)
C 1199 (%55.2) 973 (%44.8)
D 1492 (%68.7) 680 (%31.3)
E 1349 (%62.1) 823 (%37.9)
F 1894 (%87.2) 278 (%12.8)
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
folk2onto: folk2onto: tests outputtests output
0
500
1000
1500
2000
2500
Exp.A
Exp.B
Exp.C
Exp.D
Exp.E
Exp.F
Erroneus
Correct
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
Open issuesOpen issues
Tag filtering through WordNet– blog, wiki– xml, rdf, rss– wordpress, tuenti, flickr– social, open
“tags can be about so many things – mapping to Dublin Core
is a weak choice” Mappings
– Coverage: Japan– Language: Spanish
Learning the right synset of eg. "jaguar" – "vehicle", "video
game console", or "cat of prey"
– "<dc:subject>Jaguar</dc:subject>"
Word-sense disambiguation– tag-category
disambiguation
ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)
That was all about That was all about CollOnBus/folk2ontoCollOnBus/folk2onto
Thank you very much!Any question?