Deriving human readable labels from sparql queries

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

INSTITUTE FOR APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS

www.kit.edu

Deriving Human-Readable Labels from SPARQL Queries

Basil Ell, Denny Vrandečić, and Elena Simperl

7th International Conference on Semantic Systems, Graz

7 September 2011

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

2 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Outline

Motivation

Human-readability of the LOD cloud

Method

Evaluation

Conclusions


3 31.03.2014

Introduction

Entities are identified by URIs, such as

http://de.dbpedia.org/resource/Graz

http://rdf.freebase.com/ns/m.043j22x

Human-readable names can be provided e.g. using the property rdfs:label dbpedia:Austria

rdfs:label

"Österreich"@de

Basil Ell – Deriving Human-Readable Labels from SPARQL queries


4 31.03.2014

Motivation – Why are labels necessary? Scenario: linked data browsing


[SIGMA]

Is this

meaningful to

human users?


5 31.03.2014

Human-Readability of the LOD Cloud

BTC2010 Corpus [BTC2010]

3,167,799,445 ntriples

159,177,123 distinct subjects

137,156,213 (86.17%) have no value for any of the properties rdfs:label, rdfs:comment,

dc:title, and foaf:name.

61.8% of the analyzed non-information resources have

no label (regarding 36 labeling properties) [Ell et al. 2011]



6 31.03.2014

Main Idea

Can we automatically derive labels for entities by

analyzing SPARQL queries?

station can be used as a label for

http://dbpedia.org/ontology/RadioStation


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?station WHERE {

?station rdf:type dbo:RadioStation

}


7 31.03.2014

Analyzed data set

USEWOD2011 corpus[USEWOD2011]

Contains log files from DBpedia and SWDF

distinct parsable SPARQL SELECT queries:

1,212,932 (DBpedia)

195,641 (SWDF)


Semantic Web Dog Food

(SWDF)


8 31.03.2014

Classification of variable names

Class Description

short String length up to 2 chars. Common: s, p, o, x.

stop Known no-short strings that cannot be used as labels, e.g. subject,

instance, uri.

lang A no-stop string that belongs to a natural language or that consists of

separated words of a natural language, e.g. Artist and RadioStation.

Checked for the languages {de, en, es, fr, it} using the [Corpex]

webservice.

(The Corpex dataset consists of all words and their frequencies as

extracted and counted from instances of Wikipedia in multiple

languages. [Vrandecic et al. 2011])

nolang Variable names that are neither short, nor stop, nor lang.



9 31.03.2014

Classification of triple patterns

Triple pattern classes P = {RRV, RVR, VRL, ...}

R is a resource, V is a variable, L is a literal

Ignoring features such as UNION, OPTIONAL etc.


SELECT ... WHERE {

...

dbpedia:Karlsruhe dbo:populationTotal ?population .

...

}

RRV pattern


10 31.03.2014

Classification of triple patterns (2)


DBpedia


11 31.03.2014

DBpedia – top query patterns (pruned n >= 5000)


8312 queries

consist of one

VVL triple and

three VRV triples

Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern


12 31.03.2014

SWDF – top query patterns pruned (n >= 1000)


Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern


13 31.03.2014

Derivation pattern 1: 1 x RRV (31.75% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

property R2 iff local_name(R2) = V and lang(V).

V‘ can be derived from V by substituting

separators and splitting camel-cased words into

constituents.


<http://dbpedia.org/page/NASA> R1

<http://dbpedia.org/property/agencyName> R2

?agencyName V


14 31.03.2014

Derivation pattern 2: Any graph with VRR (22.32% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

class R2 iff lang(V) and R1 = rdf:type

Example:

?place rdf:type dbo:Location


?paper V

<http://data.semanticweb.org/ns/swc/ontology#isPartOf> R1

<http://data.semanticweb.org/conference/www/2009/proceedings> R2


15 31.03.2014

Evaluation – 1 x RRV


1,366,363 triples of class RRV

549,093 cases: local_name(R2) = V

817,269 cases: local_name(R2) ≠ V

226 pairs (URI, guessed label)

54.5% correct: sufficiently similar to existing labels

14% correct: manual evaluation

9.1% correct within a given context (location for dbo:residence)

22.4% wrong (contained for dbprop:creator)

68%


16 31.03.2014

Evaluation – Any graph with VRR


80,455 triples of class RRV

549,093 cases: local_name(R2) = V

60 distinct URIs, 36 labels

25% correct: sufficiently similar to existing labels

39.975% correct: manual evaluation

35.025% wrong (scientist for dbo:SoccerPlayer)

64.975%


17 31.03.2014

Conclusions

Approach for automatically deriving labels

Acceptable precision: most derived labels

matched the already existing labels (atypical

datasets)

Derived variable names less specific

Derived labels for terminological entities

(properties and classes), not for instances.



18 31.03.2014

References & Acknowledgements

[BTC 2010]http://km.aifb.kit.edu/projects/btc-2010/

[Ell et al. 2011] Labels in the Web of Data, ISWC2011, to appear.

[SIGMA] http://sig.ma/search?q=Sidney+Bechet

[USEWOD2011] http://data.semanticweb.org/usewod/2011/challenge.html

[Corpex] http://km.aifb.kit.edu/sites/corpex/

[Vrandecic et al. 2011]


Part of this work has been carried out in the framework of the German Research

Foundation (DFG) project entitled: "Entwicklung einer Virtuellen Forschungs-

umgebung für die Historische Bildungsforschung mit Semantischer Wiki-Techno-

logie - Semantic MediaWiki for Collaborative Corpora Analysis"

(INST 5580/1-1), in the domain of "Scientific Library Services and Information

Systems" (LIS).


19 31.03.2014

THANK YOU FOR YOUR ATTENTION



20 31.03.2014

BACKUP SLIDES



21 31.03.2014

Triple pattern classes (SWDF)



22 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Internet

Deriving human readable labels from sparql queries