22
KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association INSTITUTE FOR APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS www.kit.edu Deriving Human-Readable Labels from SPARQL Queries Basil Ell , Denny Vrandečić, and Elena Simperl 7th International Conference on Semantic Systems, Graz 7 September 2011

Deriving human readable labels from sparql queries

Embed Size (px)

DESCRIPTION

This presentation was given at I-SEMANTICS 2011, 7th International Conference on Semantic Systems, Graz, and is related the publication of the same title. Over 80% of entities on the Semantic Web lack a human-readable label. This hampers the ability of any tool that uses linked data to offer a meaningful interface to human users. We argue that methods for deriving human-readable labels are essential in order to allow the usage of the Web of Data. In this paper we explore, implement, and evaluate a method for deriving human-readable labels based on the variable names used in a large corpus of SPARQL queries that we built from a set of log files. We analyze the structure of the SPARQL graph patterns and offer a classification scheme for graph patterns. Based on this classification, we identify graph patterns that allow us to derive useful labels. We also provide an overview over the current usage of SPARQL in the newly built corpus. The publication is available at http://www.aifb.kit.edu/images/9/9d/Sparql_queries.pdf

Citation preview

Page 1: Deriving human readable labels from sparql queries

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

INSTITUTE FOR APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS

www.kit.edu

Deriving Human-Readable Labels from SPARQL Queries

Basil Ell, Denny Vrandečić, and Elena Simperl

7th International Conference on Semantic Systems, Graz

7 September 2011

Page 2: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

2 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Outline

Motivation

Human-readability of the LOD cloud

Method

Evaluation

Conclusions

Page 3: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

3 31.03.2014

Introduction

Entities are identified by URIs, such as

http://de.dbpedia.org/resource/Graz

http://rdf.freebase.com/ns/m.043j22x

Human-readable names can be provided e.g. using the property rdfs:label dbpedia:Austria

rdfs:label

"Österreich"@de

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 4: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

4 31.03.2014

Motivation – Why are labels necessary? Scenario: linked data browsing

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

[SIGMA]

Is this

meaningful to

human users?

Page 5: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

5 31.03.2014

Human-Readability of the LOD Cloud

BTC2010 Corpus [BTC2010]

3,167,799,445 ntriples

159,177,123 distinct subjects

137,156,213 (86.17%) have no value for any of the properties rdfs:label, rdfs:comment,

dc:title, and foaf:name.

61.8% of the analyzed non-information resources have

no label (regarding 36 labeling properties) [Ell et al. 2011]

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 6: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

6 31.03.2014

Main Idea

Can we automatically derive labels for entities by

analyzing SPARQL queries?

station can be used as a label for

http://dbpedia.org/ontology/RadioStation

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?station WHERE {

?station rdf:type dbo:RadioStation

}

Page 7: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

7 31.03.2014

Analyzed data set

USEWOD2011 corpus[USEWOD2011]

Contains log files from DBpedia and SWDF

distinct parsable SPARQL SELECT queries:

1,212,932 (DBpedia)

195,641 (SWDF)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Semantic Web Dog Food

(SWDF)

Page 8: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

8 31.03.2014

Classification of variable names

Class Description

short String length up to 2 chars. Common: s, p, o, x.

stop Known no-short strings that cannot be used as labels, e.g. subject,

instance, uri.

lang A no-stop string that belongs to a natural language or that consists of

separated words of a natural language, e.g. Artist and RadioStation.

Checked for the languages {de, en, es, fr, it} using the [Corpex]

webservice.

(The Corpex dataset consists of all words and their frequencies as

extracted and counted from instances of Wikipedia in multiple

languages. [Vrandecic et al. 2011])

nolang Variable names that are neither short, nor stop, nor lang.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 9: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

9 31.03.2014

Classification of triple patterns

Triple pattern classes P = {RRV, RVR, VRL, ...}

R is a resource, V is a variable, L is a literal

Ignoring features such as UNION, OPTIONAL etc.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

SELECT ... WHERE {

...

dbpedia:Karlsruhe dbo:populationTotal ?population .

...

}

RRV pattern

Page 10: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

10 31.03.2014

Classification of triple patterns (2)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

DBpedia

Page 11: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

11 31.03.2014

DBpedia – top query patterns (pruned n >= 5000)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

8312 queries

consist of one

VVL triple and

three VRV triples

Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern

Page 12: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

12 31.03.2014

SWDF – top query patterns pruned (n >= 1000)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Graph pattern classes

visualized as hypergraph:

n Number of

instances

TP Name of

triple pattern

Page 13: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

13 31.03.2014

Derivation pattern 1: 1 x RRV (31.75% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

property R2 iff local_name(R2) = V and lang(V).

V‘ can be derived from V by substituting

separators and splitting camel-cased words into

constituents.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

<http://dbpedia.org/page/NASA> R1

<http://dbpedia.org/property/agencyName> R2

?agencyName V

Page 14: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

14 31.03.2014

Derivation pattern 2: Any graph with VRR (22.32% of all DBpedia queries)

Assumption: V‘ is a human-readable label for

class R2 iff lang(V) and R1 = rdf:type

Example:

?place rdf:type dbo:Location

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

?paper V

<http://data.semanticweb.org/ns/swc/ontology#isPartOf> R1

<http://data.semanticweb.org/conference/www/2009/proceedings> R2

Page 15: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

15 31.03.2014

Evaluation – 1 x RRV

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

1,366,363 triples of class RRV

549,093 cases: local_name(R2) = V

817,269 cases: local_name(R2) ≠ V

226 pairs (URI, guessed label)

54.5% correct: sufficiently similar to existing labels

14% correct: manual evaluation

9.1% correct within a given context (location for dbo:residence)

22.4% wrong (contained for dbprop:creator)

68%

Page 16: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

16 31.03.2014

Evaluation – Any graph with VRR

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

80,455 triples of class RRV

549,093 cases: local_name(R2) = V

60 distinct URIs, 36 labels

25% correct: sufficiently similar to existing labels

39.975% correct: manual evaluation

35.025% wrong (scientist for dbo:SoccerPlayer)

64.975%

Page 17: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

17 31.03.2014

Conclusions

Approach for automatically deriving labels

Acceptable precision: most derived labels

matched the already existing labels (atypical

datasets)

Derived variable names less specific

Derived labels for terminological entities

(properties and classes), not for instances.

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 18: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

18 31.03.2014

References & Acknowledgements

[BTC 2010]http://km.aifb.kit.edu/projects/btc-2010/

[Ell et al. 2011] Labels in the Web of Data, ISWC2011, to appear.

[SIGMA] http://sig.ma/search?q=Sidney+Bechet

[USEWOD2011] http://data.semanticweb.org/usewod/2011/challenge.html

[Corpex] http://km.aifb.kit.edu/sites/corpex/

[Vrandecic et al. 2011]

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Part of this work has been carried out in the framework of the German Research

Foundation (DFG) project entitled: "Entwicklung einer Virtuellen Forschungs-

umgebung für die Historische Bildungsforschung mit Semantischer Wiki-Techno-

logie - Semantic MediaWiki for Collaborative Corpora Analysis"

(INST 5580/1-1), in the domain of "Scientific Library Services and Information

Systems" (LIS).

Page 19: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

19 31.03.2014

THANK YOU FOR YOUR ATTENTION

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 20: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

20 31.03.2014

BACKUP SLIDES

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 21: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

21 31.03.2014

Triple pattern classes (SWDF)

Basil Ell – Deriving Human-Readable Labels from SPARQL queries

Page 22: Deriving human readable labels from sparql queries

KIT – Karlsruhe Institute of Technology Institute for Applied Informatics and Formal Description Methods

22 31.03.2014 Basil Ell – Deriving Human-Readable Labels from SPARQL queries