50
Towards a solution to extract knowledge from the social web (“metadata first, ontologies second”) Project Collaborative Ontology Building System (CollOnBus) INTEK Nets 2005-2007 Aitor Almeida, Borja Sotomayor, Joseba Abaitua, Diego Lopez de Ipiña

Metadata first, ontologies second

Embed Size (px)

Citation preview

Page 1: Metadata first, ontologies second

Towards a solution to extract knowledge from the social web

(“metadata first, ontologies second”)

Project Collaborative Ontology Building System (CollOnBus)

INTEK Nets 2005-2007

Aitor Almeida, Borja Sotomayor,

Joseba Abaitua, Diego Lopez de Ipiña

Page 2: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Social web: source of Social web: source of knowledgeknowledge

Crowds share and tag resources of different types: – pictures, music, posts, videoclips, slides, books,

bookmarks, etc.

Social tagging (or crowd-tagging) is a very effective and economic way of generating knowledge

Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ”

<http://en.wikipedia.org/wiki/Crowdsourcing>

Page 3: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Related work Related work (since 2006)(since 2006)

mapping tags to ontologies Schmitz 2006. Inducing

Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop

Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop

identifying semantic relations Specia, Motta. 2007.

Integrating Folksonomies with the Semantic Web. ESWC2007

transforming folksonomies into formal representations

Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop

Hotho et al. 2006. Trend Detection in Folksonomies. Semantics And Digital Media Technology SAMT2006

Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop

Page 4: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Which Which knowledge knowledge representationrepresentation model? model?

Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation?

Semantic Networks– Lexical networks (WordNet)

Taxonomines – eg. categories from Wikipedia, Thesauri

Metadata– “mapping to Dublin Core is a weak choice”

Ontologies

“metadata first, ontologies second”

Page 5: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

Page 6: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

Aitor Almeida

Borja Sotomayo

r

Diego López de

Ipiña

Page 7: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

Page 8: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging posts

Page 9: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging slidesslides

Page 10: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging booksbooks

Page 11: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging URLURL

Page 12: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowd-sharing of tagsCrowd-sharing of tags

Flickr, del.icio.us... group tags by social sharing (or “co-usage”)– but the semantic information that socially

shared tags acquire is poorly exploited

Page 13: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Mapping folksonomies Mapping folksonomies into tag clustersinto tag clusters

RawSugar <http://rawsugar.com/>– allows users to assign

hierarchies to their tags, improving the navigation and searching of folksonomies

– non-expert users will find it easier to tag resources without any restrictions

Page 14: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Tag clusteringTag clustering

TAG clustering is the main technique used to improve the wealth of social tagging– but semantic

relations are not detected

Page 15: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Beyond tag clusters?

Page 16: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Should we mapShould we map them intothem into ontologies?ontologies?

Page 17: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Better mappingBetter mapping 1st1st

into into metadatametadata

blog, japan, personal, spanish, geek

“Kirai.NET”http://kirai.bitacoras.com

Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan

Page 18: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Metadata vs ontologiesMetadata vs ontologies

Why are metadata structures better than ontologies (for resource classification and categorisation)?

Let’s reflect on different knowledge representations and about who use them:– Folksonomies (crowds)– Taxonomies, ontologies (knowledge

engineers, AI/SW practitioners)– Metadata structures (librarians, archivists,

documentalists)

Page 19: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

What are metadata?What are metadata?

Page 20: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG vs metadata?

Page 21: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Metadata vs ontologiesMetadata vs ontologies

Why are metadata structures better?– Because metadata provide wide and complete

range of facets for representing knowledge about an entity or resource

– Each facet (or data type) could be part of one or several ontological structures

– Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)”

– “A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).

Page 22: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Better mapping 1st folksonomiesBetter mapping 1st folksonomies into into metadata structuresmetadata structures

blog, japan, personal, spanish, geek

“Kirai.NET”http://kirai.bitacoras.com

Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan

Page 23: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Initiative Metadata Initiative

http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif

Page 24: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Initiative Metadata Initiative

Page 25: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Inicitive Metadata Inicitive

Page 26: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Our mapping tool:Our mapping tool:folk2ontofolk2onto (? folk2meta)(? folk2meta)

Tagged resource

RS

S/H

TM

L

Tag Retriever

FolksonomyTAGs

Tag Trainer

Tag Distiller

TAG

s

TAG

s

Trained Tags DB

Tra

inin

gT

rain

ing

Filtered Tags

XM

L

Mappings Trainer

Mappings Distiller

XM

L

XM

L

Wordnet

Syn

sets

Wordnet

Syn

sets

Mapping DB

Map

pin

gM

app

ing

Annotated resource

RDF

designed by

Borja Sotomayor

Page 27: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Tag Distiller Tag Distiller

Tag Distiller: – Downloads tags from Web 2.0 sites– Matches each tag against WordNet

(taking into account the tag’s context/cloud)

– Filters out synonyms – Keeps the list of remaining tags – Generates an XML file

Implemented by Aitor Almeida

Page 28: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG clouds TAG clouds fromfrom del.icio.us del.icio.us

1. http://del.icio.us/url/check?url=site2. Looks for <title> and gets its content: the

hash3. Gets the RSS in

http://del.icio.us/rss/url/ + hash

4. Then tag-clouds are downloaded from <rdf:li resource=\"http://del.icio.us/tag/">

Page 29: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG cloudsTAG clouds from from Technorati Technorati

Technorati: blog aggregator We can get tag clouds from Technoraty

through: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog URL]

Page 30: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG clouds TAG clouds fromfrom Technorati Technorati

<?xml version="1.0" encoding="utf-8"?> <!-- generator="Technorati API version 1.0 /blogposttags" --><!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN"

"http://api.technorati.com/dtd/tapi-002.xml"> <tapi version="1.0"> <document>

<result> <querycount>13</querycount>

</result> <item>

<tag>christmas cookie recipes</tag> <posts>274</posts>

</item> ….

Page 31: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Tagged URL Tagged URL atat Technorati Technorati

All <tag> elements are downloaded

To get the “title” http://api.technorati.com/bloginfo?key=[apikey]&url=[blog url]

And<name> is recovered

Page 32: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

semantic relationssemantic relations in WordNet in WordNet

WordNet relations for tag ‘Spanish’:

Romance,Romance language,Latinian language

Spanish

Mexican Spanish

hypernym

hyponym

national, subject

nation, land, country, a people

Spanish,Spanish people

hyponym

meronym

Page 33: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG filtering algorithmTAG filtering algorithm

Tags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is

assigned If it has more than one, then

– T: resources tag set– Related(a,b): gives 1 if a and b have some type of relation (hypernym,

hyponym, holonym, meronym)– w: weights

Several iterations are made until a meaning is found (10 iterations max.)

folktagstag

Fttags

tagTtw

F

ttagrelatedw

T

ttagrelatedw

tag

||

),(

||

),(

max

}{

Page 34: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG filtering algorithmTAG filtering algorithm

Once senses have been discarded, synonyms are also filtered out

Words then are grouped in senses using WordNet’s relation network

The output is exported to a:– XML file with senses– XML file with tags that were discarded– RDF containing WordNet’s relation

network

Page 35: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG XML fileTAG XML file

<?xml version="1.0" encoding="UTF-8"?><resource>

<tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle><type>Text</type><format>text/html</format><identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier><tags>

<tag><lemma>tune</lemma>

< idlex>236726</idlex></tag><tag>

<lemma>bd</lemma><idlex>5604473</idlex>

</tag>

Page 36: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG file TAG file without senseswithout senses

<resource><tittle>Wired News: The Virus That Ate DHS</tittle><type>Text</type><format>text/html</format><identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier><tags>

<tag>bit200f06</tag><tag>group141</tag><tag>dhs</tag><tag>group35</tag><tag>malware</tag><tag>group91</tag><tag>group17</tag><tag>group53</tag><tag>computer_security</tag>

</tags></resource>

Page 37: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

WordNet’s WordNet’s sense setssense sets

Words are grouped in sense sets– If related(a,b) is = 1, then words are

grouped in the same set– The relations depth has to be equal or

smaller than 3

Page 38: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Tag Trainer Tag Trainer

Page 39: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto:Map TrainerMap Trainer

Page 40: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: Tag MapperTag Mapper

The Mapper makes tag-element associations

These associations are made according to the senses asigned by the Distiller

Mapping targets into Dublin Core metadata records

Page 41: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Dublin Core Dublin Core

The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.):– Title: URL’s title -> from the <title> XML

tag– Type: content type -> depending on the

source (here both are “Text”)– Format: MIME class -> depending on the

source (here we have 2 text/html)– Identifier: we take the resource’s URL

Page 42: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Dublin Core Dublin Core

The Tag-Mapper deals with:– Subject: the “topic”.– Language: en, es, fr, de, ru...– Coverage: when, where (about the topic)– Rights: type of licence

Page 43: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: mapping formulaemapping formulae

When a TAG has one mapping, that TAG is used If it has more than one:

If it has no mapping, then:

mapMm

wM

mtrelated

||

),(

rel

T

ii

countomapt

ow

T

MTSameOrSonw

tcount

tcount

io

1

),(

)(

)(

Page 44: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: file mapping file mapping

<rdf:RDF xmlns:j.0="http://purl.org/dc/elements/1.1" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

> <rdf:Description rdf:nodeID="A0">

<rdf:type rdf:resource="http://purl.org/dc/elements/1.1identifier"/><j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> <j.0:type>Text</j.0:type> <j.0:format>text/html</j.0:format>

<j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle><j.0:subject>database</j.0:subject>

<j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject>

</rdf:Description></rdf:RDF>

Page 45: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Mapping trainerMapping trainer

Page 46: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: 6 tests (A-F)6 tests (A-F)

Experiment A: Selecting random synsets for the tags. Experiment B: Without any limit in the semantic relation

depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1).

Experiment C: Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0).

Experiment D: Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6).

Experiment E: Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6).

Experiment F: Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).

Page 47: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: tests output tests output

Experiment Correct synsets Erroneous synsets

A 706 (%32.5) 1466 (%67.5)

B 1594 (%73.4) 578 (%26.6)

C 1199 (%55.2) 973 (%44.8)

D 1492 (%68.7) 680 (%31.3)

E 1349 (%62.1) 823 (%37.9)

F 1894 (%87.2) 278 (%12.8)

Page 48: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: tests outputtests output

0

500

1000

1500

2000

2500

Exp.A

Exp.B

Exp.C

Exp.D

Exp.E

Exp.F

Erroneus

Correct

Page 49: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Open issuesOpen issues

Tag filtering through WordNet– blog, wiki– xml, rdf, rss– wordpress, tuenti, flickr– social, open

“tags can be about so many things – mapping to Dublin Core

is a weak choice” Mappings

– Coverage: Japan– Language: Spanish

Learning the right synset of eg. "jaguar" – "vehicle", "video

game console", or "cat of prey"

– "<dc:subject>Jaguar</dc:subject>"

Word-sense disambiguation– tag-category

disambiguation

Page 50: Metadata first, ontologies second

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

That was all about That was all about CollOnBus/folk2ontoCollOnBus/folk2onto

Thank you very much!Any question?