Faceted Exploration of Multiple RDF Data Sources Using SPARQL

7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL

1/84

Faceted Exploration of Multiple RDFData Sources Using SPARQL

by

Hanief Bastian(2217547)

Submitted in partial fulfillment

of the requirements

for the degree of

Master of Science in Computer Engineering

Supervisors:

Prof. Dr.-Ing. Jrgen Ziegler

Dipl.-Inform. Philipp Heim

University of Duisburg-Essen

Faculty of Engineering

Department Computer and Cognitive Sciences

Institute of Interactive Systems and Interaction Design

6 May 2009


2/84


3/84

i

Abstract

Many applications that access the Semantic Web are structured in Three-Tier

Architecture consisting of Client Tier, Server Tier, and Data Tier. With the growing

number of SPARQL endpoints, parts of the data access logic have moved to the Data

Tier. This allows the query building process to be shifted to the Client Tier and

therewith ease the resource and the performance cost to access information contained in

the Semantic Web.

In this thesis, we describe the transformation from a Three-Tier Architecure to a Two-Tier Architecture using the example of gFacet, a tool for graph based faceted access to

the Semantic Web and we support the abilities of gFacet tool by generating efficient

SPARQL queries on the client-side. The former Three-Tier Architecture of gFacet did

not efficiently access the Semantic Web via SPARQL endpoint, mainly because the

intermediate processing in the Server Tier could increase the total execution time. This

was the reason to reconstruct the architecture as well as the whole query building

process of the gFacet tool by moving all the functionalities of the application server into

the Client Tier and improve the performance of the queries in order to support an

efficient and client-side access to any SPARQL endpoint and thereby to various

information contained in the Semantic Web.

We provide all the queries that allow faceted exploration on a large RDF dataset. In this

thesis, we use an RDF dataset released by DBpedia. All the queries that support gFacet

to search a certain concept, to retrieve and filter the information, and to change the

information point-of-view are described in detail and evaluated regarding their

performance. We implement two different approaches to retrieve a large amount of

instances that enable paging through these instances; by retrieving all instances at once

to the client using standard SPARQL and by retrieving a subset of the possible instances

using SPARQL extensions.

We facilitate the functionality of gFacet by providing the opportunity to explore more

than one RDF source. As an additional RDF dataset, we use RDF data released by

MusicBrainz. With this feature, gFacet can search for more information by exploring

the additional data source.

Keywords: SPARQL, gFacet, RDF, DBpedia, faceted exploration, Semantic Web,

SPARQL query optimization, multiple SPARQL endpoints access


4/84

ii

Acknowledgements

First I am so grateful to Allah SWT for the health, ideas, and everything that make this

thesis accomplished. I also offer my sincerest gratitude to Prof. Dr.-Ing. Jrgen Ziegler

and Dipl.-Inform. Philipp Heim. Without the guidance, the great efforts, and great ideas

they have been given, this thesis would not have been completed or written. I simply

could not wish for more supportive and friendlier supervisors.

I wish to express my warm thanks to my brothers at home Mas Andy, Dicky, and Evan.

The encouragement they always give me is so meaningful to me.

Lastly, and most importantly, I wish to thank my parents, Papa Eddy Purwanto and

Mama Latifah. They bore me, raise me, support me, and love me. Thank you for the

great opportunity you gave me, so I can be here standing even though away from you

both. To them I dedicate this thesis.

Hanief Bastian

May 2009

Duisburg-Germany
http://www.interactivesystems.info/Mitarbeiter/Personen/Heimhttp://www.interactivesystems.info/Mitarbeiter/Personen/Heim


5/84

iii

Contents

Chapter 1: Introduction ................................................................................................ 11.1.Motivation ............................................................................................ 11.2.Starting Point ........................................................................................ 2

1.2.1.Semantic Web ............................................................................ 21.2.2.Resource Description Language (RDF) ..................................... 31.2.3.SPARQL .................................................................................... 41.2.4.DBpedia Project ....................................................................... 101.2.5.MusicBrainz ............................................................................. 121.2.6.DBTune .................................................................................... 121.2.7.Faceted Navigation .................................................................. 131.2.8.gFacet Project ........................................................................... 13

1.3.Task Description ................................................................................. 161.4.Related Works .................................................................................... 161.5.Thesis Outline ..................................................................................... 17

Chapter 2: The Architecture ...................................................................................... 182.1.The Strategy........................................................................................ 182.2.Client-side gFacet Architecture .......................................................... 19

2.2.1.SPARQL Query Dispatcher ..................................................... 202.2.2.SPARQL Result Parser ............................................................ 212.2.3.SPARQL Query Builder .......................................................... 22

Chapter 3: Browsing DBpedia ................................................................................... 243.1.Dispatching a Query to DBpedia ........................................................ 243.2.Exploring DBpedia with gFacet ......................................................... 25

3.2.1.Searching for Concepts ............................................................ 263.2.2.Selecting the Initial Node ......................................................... 283.2.3.Expanding the Graph ............................................................... 343.2.4.Filtering .................................................................................... 373.2.5.Result Set Pivoting ................................................................... 43

3.3.Evaluation ........................................................................................... 45


6/84

iv

3.4.Time Measurements ........................................................................... 453.5.Correctness Measurements ................................................................. 48

Chapter 4: Multiple Sources ....................................................................................... 52

4.1.The Strategy........................................................................................ 534.2.The Obstacles ..................................................................................... 55

4.3.Finding the Equivalent Data ............................................................... 554.4.Transforming the URIs ....................................................................... 57

4.4.1.Zitgist URI to DBTune URI Conversion ................................. 574.4.2.MusicBrainz Scheme to Music Ontology Mapping ................. 58

4.5. Implementation ................................................................................... 59Chapter 5: Conclusions & Future Works ................................................................. 64

5.1.Conclusions ........................................................................................ 645.2.

Future Works ...................................................................................... 655.2.1.Autocompletion Text Search ................................................... 655.2.2.Searching for Instances ............................................................ 665.2.3.Automatic Data Interlinking .................................................... 66


7/84

v

List of Figures

1.1. RDF graph representation ........................................................................... 41.2. RDF graph example .................................................................................... 51.3. Linking Open Data cloud ............................................................................ 111.4. gFacet Architecture ..................................................................................... 141.5. gFacet user interface ................................................................................... 152.1. gFacet Two-Tier Architecture .................................................................... 192.2. SPARQLQuery (the dispatcher) class diagram .......................................... 202.3. RDFTerm class diagram ............................................................................. 212.4. gFacet Data Flow Diagram ......................................................................... 233.1. User actions flow while browsing DBpedia .............................................. 263.2. Searching a concept .................................................................................... 273.3. Opening the initial node .............................................................................. 293.4. The Relation List ........................................................................................ 323.5. Constructing a pair of predicate and related concept .................................. 333.6. Opening a new node by selecting a relation ............................................... 343.7. A chain of 4 nodes is created after user gradually expanding the nodes .... 353.8.

Model a chain of 4 nodes describing parent-child characteristic ................ 36

3.9. A constraint selected in direct child of result set triggers a basic filtering . 383.10. Hierarchical filtering ................................................................................... 393.11. Union filtering ............................................................................................ 413.12. Intersection filtering .................................................................................... 423.13. A chain of 4 nodes after pivoting ............................................................... 433.14. A model of a chain of 4 nodes after pivoting ............................................. 443.15. Average elapsed time for nodes with certain amount of instances ............. 463.16. A chain of 4 nodes for evaluation ............................................................... 47


8/84

vi

3.17. Sample dataset : A chain of 4 nodes with Britpop Musical Group as theresult set ...................................................................................................... 48

3.18. Relationships diagram for the sample dataset ............................................. 494.1. Data sets that have been published and interlinked by Linking Open Data

project (March 2009) ................................................................................... 53

4.2. gFacet model of multi sources exploration ................................................. 544.3. RDF graph of a subject with three equivalent resources ............................ 564.4. Two equivalent resources from (a) dbPedia, (b) MusicBrainz ................... 574.5. Extracting the class and UUID of MusicBrainz to be mapped to DBTune

URI scheme ................................................................................................. 58

4.6. Data flow diagram of gFacet multi sources exploration ............................. 594.7. Relation list with sameAs:musicBrainz element ........................................ 614.8. A chain of two nodes from distinct sources ................................................. 61


9/84

vii

List of Listings

1.1. N3 statements .............................................................................................. 31.2. RDF/XML serialization .............................................................................. 41.3. Simple query obtaining a name and mbox of a FOAF profile ................... 51.4. A query obtaining a name and mbox of a FOAF profile with value

constraint ..................................................................................................... 6

1.5. A query obtaining a name and mbox of a FOAF profile with OPTIONALclause .......................................................................................................... 6

1.6. Simple query getting a name and mbox of a FOAF profile with groupgraph pattern ............................................................................................... 7

1.7. Example of COUNT clause ........................................................................ 81.8. Example of subquery .................................................................................. 81.9. SPARQL query results in XML format ...................................................... 82.1. HTTP trace of query dispatching ................................................................ 212.2. Native Actionscripts datatypes representing the query results .................. 223.1. Dispatching a SPARQL query to DBpedia ................................................. 253.2. Query for concept searching ....................................................................... 283.3.

Query to retrieve all intances ...................................................................... 30

3.4. Query to retrieve a subset of instances ....................................................... 313.5. Query for obtaining a total number of possible solutions ........................... 313.6. Query for obtaining all the pairs of predicate and concept ......................... 333.7. Query for a chain of 4 nodes; The result set is the initial node .................. 363.8. Query for basic filtering. Constraint : footballer Thomas Hitzlsperger ...... 383.9. Query for hierarchical filtering. Constraint : English Club C4.................... 403.10. Query for union filtering. Constraint : English Club C2 and C4 ................. 413.11. Query for intersection filtering. Constraint : English Club D1 and C4 ....... 423.12. Query for result set pivoting. Result set : node C ....................................... 44


10/84

viii

4.1. Query to get all relations and count the interlinked data of a concept ........ 604.2. Dispatching a SPARQL query to DBTune for MusicBrainz dataset .......... 624.3. Query to get the interlinked data from DBtune for MusicBrainz dataset ... 625.1. Query to search for instances ...................................................................... 66


11/84

ix

List of Tables

1.1. SPARQL result for simple query .................................................................... 61.2. SPARQL result with OPTIONAL clause ....................................................... 73.1. The instances of each evaluation node ........................................................... 463.2. Average of elapsed time between HTTP request and response for a chain

of 4 nodes ........................................................................................................ 47

3.3. Result of OR-operation test cases ................................................................... 503.4. Result of AND-operation test cases ................................................................ 514.1. Comparison MusicBrainz entitys type and Music Ontology class ............... 58


12/84

x


13/84

Page | 1

Chapter 1

Introduction

1.1. Motivation

The Semantic Web [1] was introduced to extend the power of the current Web by

making its content understandable to machines and thus allow machines to perform

automated information gathering and to obtain more meaningful results. This requires

the semantics of content in the Web to be described in a machine-readable form by

using formal languages like RDF [14] and OWL [20]. These languages allows web

content to be assigned to semantically defined concepts and related to one another by

semantically defined relationships. That way annotated, information can be found more

efficiently and with more certainty.

With the steady growth of the Semantic Web, more and more annotated information is

published on the Web, leading to a growing number of RDF datasets. Even though RDF

data is originally meant to be read by machines, information about the meaning of Web

content and its interrelations can be highly valuable for humans, too. However, there is

no defined method to render RDF data in a way that can be easily understood by

humans, in comparison to, for example, HTML [15], where markups are used for a

proper presentation. In order to let also humans benefit from information contained in

RDF datasets, methods are needed to access and to render this information in an

appropriate way.

One promising way to access information that is contained in RDF data is offered by thetool gFacet [11]. It combines graph-based visualization with faceted filtering

functionalities to build up queries and thus control what information is displayed on the

screen. The queries are formulated in SPARQL [9], the W3C recommendation to access

RDF data. SPARQL has a SQL-like syntax and can be used to express queries across

diverse data sources.

With the existing SPARQL endpoints around the Web that allow RDF dataset to be

queried in order to get the results, gFacet can be the bridge between user and the RDF

dataset. Efficient and accurate queries need to be implemented into gFacet, so that

gFacet can be a powerful faceted RDF browser. But in doing this, we also have to find


14/84

1. Introductions

Page | 2

the correct architecture for gFacet. The architecture has to be simple but efficient and

supports direct connection with SPARQL endpoint in order to enable active accessing

to any SPARQL endpoint.

In general, the main focus of this thesis is given as follows

Building an efficient architecture for gFacet. We find that Two Tier architectureis suitable for gFacet. With Two-Tier Architecure, gFacet tool in the Client Tier

can directly communicate with a SPARQL endpoint.

Optimize the query performance of the current gFacet by building new efficientand accurate queries in order to support faceted exploration over a large RDF

dataset.

Allowing multi RDF sources exploration using gFacet. With this feature gFacetcan browse an entity from a source to the same entity to the other sources.

1.2. Starting Point

This section gives brief description about the platforms, technologies, or projects that

are used as the foundation of the thesis. We adapt this foundation in order to make it

applicable to our work.

1.2.1. Semantic Web

Since the WWW began in the early 1990s, WWW has given a great impact for mankind

in information, education, business, and even social life. From time to time the number

of websites on the web keeps growing. According to Google Blog1, the Google index

reached 1 trillion unique URLs on the web by the end of July 2008, for comparison that

the first Google index in 1998 had 26 million pages. However, most of web pages

currently are still in the form what we called the Syntactic Web. The syntactic web

focuses only on the visual presentation of the content. Once the content been displayed,

it is up to the user to interpret the meaning.

But this trend has already begun to change since Sir Berners-Lee introduced the term of

Semantic Web in his article Semantic Web Roadmap in 1998 [1] and the following The

Semantic Web in 2001 [2]. He was wondering what if machines can talk and change

information available around the Web to each other. This idea can be done by making

the semantic of the information understandable to the machines which is the main

goal of the Semantic Web. The Semantic Web will enable machines to comprehend

semantic documents and data, not human speech and writings [2].

1http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html


15/84

1. Introductions

Page | 3

However Semantic Web is not designed to replace the Web of today but to improve it,

so the next generation of Web is accessible both to human and machines.

1.2.2. Resource Description Language (RDF)

The Resource Description Language (RDF) is a general-purpose language for

representing information about resources in the Web. It is particularly intended for

representing metadata about Web resources, but it can also be used to represent

information about objects that can be identified on the Web, even when they cannot be

directly retrieved from the Web [3].

RDF allows semantics to be expressed in a way, so that the information can be

processed by applications also, rather than being only displayed to the users. Basically,

RDF defines a data model for describing machine-processable semantics of data [4].The basic data model consists of three objects:

Resources. A resource may be an entire Web page, a part of a Web page, awhole collection of pages, or an object that is not accessible via the Web (e.g., a

printed book). Resources are always named by URIs.

Properties. A property is a specific aspect, characteristic, attribute, or relationused to describe a resource.

Statements. A specific resource, together with a named property plus the valueof that property for that resource, constitutes an RDF statement. These three

individual parts of a statement are called, respectively, the subject, thepredicate,

and the objectof the statement.

RDF statements can be represented in a triple notation called N3, in RDF/XML

serialization, and as a graph of triples.

Let us start with an example of statements in natural language

Kaka plays for AC Milan.

Kaka has jersey number 22.AC Milan has a website accessible at http://www.acmilan.com/.

In N3, the statements are presented as follows

.

"22" .

.

Listing 1.1. N3 statements


16/84

1. Introductions

Page | 4

Listing 1.2 shows the same example in RDF/XML

22

Listing 1.2. RDF/XML serialization

Figure 1.1 presents the example in graph

Figure 1.1. RDF graph representation

The resource subjects and objects are drawn as ellipses, the literal object as a square,

and the properties as labeled-directed arcs.

RDF Schema is a simple set of standard RDF resources and properties to enable people

to create their own RDF vocabularies. The data model expressed by RDF Schema is the

same data model used by object-oriented programming languages like Java. The data

model for RDF Schema allows you to create classes of data [5].

1.2.3. SPARQLRDF is the foundation of the Semantic Web. It is expected that in the future more and

more open RDF datasets are released. Allowing easy access to these collections requires

a query language that able to execute against RDF data.

Since January 2008, RDF Data Access Working Group (DAWG) of the World Wide

Web Consortium (W3C) has released a query language for RDF called SPARQL [9]; a

recursive acronym stands forSPARQL ProtocolandRDFQuery Language.


17/84

1. Introductions

Page | 5

SPARQL can be used to express queries across diverse data sources, whether the data is

stored natively as RDF or viewed as RDF via middleware. SPARQL contains

capabilities for querying required and optional graph patterns along with their

conjunctions and disjunctions [9]. SPARQL is also considered as a component of theSemantic Web.

Making Simple Query

Most of SPARQL queries contain a set of triple pattern called basic graph pattern(BGP). Each part of the triples acts as a subject, a predicate, or an object which may be

explicitly defined to a resource or literal or as a variable. SPARQL variables are

prefixed either with ? or$.

An example of RDF data presented as an RDF graph is shown below

Figure 1.2. RDF graph example

A query to find person with a given name and email address is executed against the

given RDF data

PREFIX foaf: SELECT ?name ?mboxWHERE{ ?x foaf:name ?name .?x foaf:mbox ?mbox

}

Listing 1.3. Simple query obtaining a name and mbox of a FOAF profile

The PREFIX clause defines a namespace for the FOAF2 location. The SELECT clause

specifies what the query should return in this case variable name and mbox. WHERE

clause provides the basic graph pattern to match against the data.

2http://xmlns.com/foaf/spec/


18/84

1. Introductions

Page | 6

The query matches the graph pattern of the query to the data model. The result of this

query is shown in Table 1.1. A solution sequence consists of one or multiple solution if

there is a match between the graph pattern and the model, or zero solution if there is no

matching pair.

name mbox

"Bobby Iceman"

"Tony Ironman"

Table 1.1. SPARQL result for simple query

Value Constraints

FILTER is an optional clause to restrict solutions to those for which the filter expression

evaluates to TRUE.

PREFIX foaf: SELECT ?name ?mboxWHERE{ ?x foaf:name ?name .?x foaf:mbox ?mbox .FILTER regex(?name, ^Bobby)

}Listing 1.4. A query obtaining a name and mbox of a FOAF profile with value constraint

The query above will give Bobby Iceman and his mailboxas the solution.

Including Optional Values

SPARQL query will give a non-empty solution sequence only if every query pattern

matches to the data model. Unfortunately, if at least a query pattern fails to match the

model, then the entire query will give an empty solution sequence. So, it is useful to

have query patterns that still allow the query to provide bindings even if a part of the

query pattern fails to match the data model. OPTIONAL clause gives this feature: even

if the optional part does not create any binding, it does not eliminate the solution.

SELECT ?name ?mboxWHERE{?x foaf:mbox ?mbox .OPTIONAL ( ?x foaf:name ?name )

}

Listing 1.5. A query obtaining a name and mbox of a FOAF profile with OPTIONAL clause


19/84

1. Introductions

Page | 7

The query above tries to find all the email address no matter it has the persons name or

not, as shown in Table 1.2.

name mbox

"Bobby Iceman"

"Tony Ironman"

Table 1.2. SPARQL result with OPTIONAL clause

Group Graph Pattern

Group graph pattern can consist zero, one, or multiple basic graph patterns. Groupgraph pattern is delimited with curly braces {}. The query in Listing 1.6 can be

rewritten into a query in Listing 1.3 that groups the triple patterns into two basic graph

patterns. Even both of queries have different structure; they give the same solution

sequence.

SELECT ?name ?mboxWHERE{ { ?x foaf:name ?name . }{ ?x foaf:mbox ?mbox . }

}

Listing 1.6. Simple query getting a name and mbox of a FOAF profile with group graph pattern

An extensive explanation of SPARQL syntax and semantics can be found on SPARQL

Query Language for RDF document [9].

SPARQL Extensions

There are a number of limitations in current SPARQL version, such as SPARQL is

read-only and cannot modify RDF dataset, it does not support subqueries and aggregate

functions, and so on. However, Openlink Virtuoso3 provides some extensions for

SPARQL in order to overcome the limitations above.

In this thesis only SPARQL extension for subqueries and aggregate function COUNT

will be explained, because these extensions are intensively used in the thesis.

COUNT function: COUNT function provides a function to count the number of the

solutions satisfying the criteria specified in the WHERE clause. With the count

aggregate the argument may be either * that means counting all rows, or a variable

name that means counting all the rows where this variable is bound. There can be an

3http://www.openlinksw.com/virtuoso/


20/84

1. Introductions

Page | 8

optional distinct keyword before the variable that is the argument of an aggregate. An

example can be seen in Listing 1.7. The example returns the count the amount of

variable o for each distinct p.

select ?p count (?o)from where {?s ?p ?o};

Listing 1.7. Example of COUNT clause

Subquery extension: Subquery or Inner query or Nested query is a query inside a

query. It is usually used for a complex computation that cannot be done by using only

one query. In SPARQL, subquery is added inside the WHERE clause of the query.

For example, one use case was taking all the teams in the database and for all with over

5 members, add the big_team class and a property for member count.

construct { ?team a big_team . ?team member_count ?ct }where {?team a team .{ select ?team2 count (*) as ?ctwhere { ?m member_of ?team2 } .filter (?team = ?team2 and ? ct > 5)

}}

Listing 1.8. Example of subquery

SPARQL Query Results XML Format

Most of SPARQL processors provide the SPARQL query result in a various document

format, so it allows programmers to choose the most convenient format for their

application. To make the result serializable to any application, W3C recommends

SPARQL Query Results XML Format [10], so that the returned result set is written as

an XML document.

The SPARQL results in XML document of the query in Listing 1.3 is shown below:


21/84

1. Introductions

Page | 9

BobbyIceman

mailto:[email protected]

TonyIronman

mailto:[email protected]

Listing 1.9. SPARQL query results in XML format

SPARQL results document begins with document definition and anamespace -- http://www.w3.org/2005/sparql-results# -- where all of the key elements

belong to. Inside the element there are two sub-elements, and a

results element which can be for SELECT queries or for

ASK queries. The element declares all the variables returned on the result set.These variables are the same like the variables declared in the SELECT clause of the

query. If we see back to Table 1.1, the variables are equivalent to the column heading.

section contains solution sequence (a set of query solutions). Each query

solution is stored in the sub-element . Every element

corresponds to every row in Table 1.1. Every element contains one or more

element with a name element property defining the bound variable.

The value of a query variable binding, which may be a resource/URI, a string literal, a

typed literal, or a blank node, is included as the content of the as follows:

RDF URI Reference UU

RDF Literal SS

RDF Literal S with language LS

RDF Typed Literal S with datatype URI DS


22/84

1. Introductions

Page | 10

Blank Node label II

If a variable is unbound, there will be no element for that variableincluded in the element.

1.2.4. DBpedia Project

The DBpedia project [6] is a community effort to extract structured information from

Wikipedia and to make this information available on the Web. DBpedia allows you to

ask sophisticated queries against Wikipedia and to link other datasets on the Web to

Wikipedia data. The goals of DBpedia projects are to convert Wikipedia content to a

large, multi-domain RDF dataset, which can be used further in Semantic Web

applications, to interlink DBpedia dataset with other open datasets creating a large Webof open data, and to develop interfaces so Web services can make use of DBpedia

dataset.

The DBpedia project extracts various kinds of structured information from Wikipedia,

such as infobox templates, categorization information, images, geo-coordinates, and

links to external websites [7]. Since DBpedia 3.2, the new infobox extraction method

was introduced to create DBpedia ontology.

The DBpedia dataset currently consists of around 274 million RDF triples, which have

been extracted from Wikipedia editions in 14 languages. The DBpedia knowledge base

currently describes more than 2.6 million things, including at least 213,000 persons,328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. It features

labels and short abstracts for these things in 14 different languages; 609,000 links

to images and 3,150,000 links to external web pages; 4,878,100 external links into other

RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories.

The DBpedia dataset can be accessed online by querying via a public SPARQL query

endpoint at http://dbpedia.org/sparql, hosted by Virtuoso, or by browsing as Linked

Data using Semantic Web browsers like Disco, Tabulator, or Marbles.

The DBpedia dataset is interlinked with various open dataset on the Web using RDF

links, this enable DBpedia users to discover information starting from a resource inDBpedia dataset to related data within other sources. RDF links utilization between

related resources creates a giant Web-of-Data which within September 2008 consists of

approximately 2 billion RDF triples. Figure 1.3 gives an overview of the Web of

interlinked data.

The Web-of-Data enables users to navigate for example from a resource of a musical

band in DBpedia dataset to a list of their songs in Musicbrainz or to a list of reviews of

the band in Revyu.
http://dbpedia.org/sparqlhttp://dbpedia.org/sparql


23/84

1. Introductions

Page | 11

The example RDF link below connects a URI of Portishead in DBpedia to a related URI

in Musicbrainz:

owl:sameAs

Figure 1.3. Linking Open Data cloud

DBpedia provides three different classification schemata for things.

1. Wikipedia Categories are represented using the SKOS vocabulary

4

.

2. The YAGO Classification5 is derived from the Wikipedia category systemusing Word Net.

3. Word Net Synset Links were generated by manually relating Wikipediainfobox templates and Word Net synsets, and adding a corresponding link

to each thing that uses a specific template. In theory, this classification should

be more precise then the Wikipedia category system.

4http://www.w3.org/2004/02/skos/

5http://www.mpi-inf.mpg.de/yago-naga/yago/


24/84

1. Introductions

Page | 12

These classifications enable users to select a specific thing mentioned in the executed

SPARQL queries.

1.2.5. MusicBrainz

MusicBrainz6 is a community music metadatabase that attempts to create a

comprehensive music information site. MusicBrainz collects this information about

recordings and makes it available to the public. Users can contribute their knowledge

about music which then can be shared with others.

The music metadata can consist of all data about Artists, Releases, Tracks, Labels, and

advance relationship among them.This metadata is stored in a Postgersql relational

database engine.

MusicBrainz has URI schemes to identify their entities, such as

http://musicbrainz.org/artist/UUID

http://musicbrainz.org/release/UUID

http://musicbrainz.org/track/UUID

http://musicbrainz.org/label/UUID

where UUID is a Universally Unique Identifier7 in its 36 character ASCII

representation.

1.2.6. DBTune

DBTune8 hosts a number of servers, providing access to music-related structured data,

in a Linked Data fashion. DBTune provides all the data based on open Web standard

such as RDF and SPARQL.

Various datasets has been provided by DBTune, including MusicBrainz data. DBTune

maps this MusicBrainz data based on Music Ontology [28]. And now MusicBrainz data

is available via SPARQL endpoint, http://dbtune.org/musicbrainz/sparql, powered byD2R server. The basic graph of MusicBrainz located at http://musicbrainz.org/.

6http://musicbrainz.org/

7http://en.wikipedia.org/wiki/UUID

8http://dbtune.org/
http://musicbrainz.org/label/UUIDhttp://dbtune.org/http://dbtune.org/http://dbtune.org/musicbrainz/sparqlhttp://musicbrainz.org/http://musicbrainz.org/http://dbtune.org/musicbrainz/sparqlhttp://dbtune.org/http://musicbrainz.org/label/UUID


25/84

1. Introductions

Page | 13

1.2.7. Faceted Navigation

Data on the Semantic Web is semi-structured and does not follow one fixed schema [8].

Faceted navigation [12] is an exploratory interface suitable for such data. Facets refer tocategories used to characterize information items in a collection [27]. By categorizing

data into facets, the exploration takes place when the user selects any restriction values

of the facets in order to filter the result set.

There are two possible methods to map subject-predicate-object RDF triples into facets.

First, facets are computed by the predicate that connects two resources; information

elements are RDF subjects, facets are RDF predicates and restriction-values are RDF

objects [8]. Second, facets are computed from all resources that related to resources in

the result list and grouped by their specific characteristics or concepts; using predicate

rdf:type orskos:subject for example. [13] implemented the latest method in

their browser interface. An example implementing these methods is explained below.

A result list of people related to Formula 1 racing can have predicates such as first

team, current team, former team,orlast teamthat connect to a group of

racing teams, and predicates such as lives in, born in, ordied inthat connectto a group of countries. With the first method, there will be 7 facets constructed from

the predicates. But, only facets racing team and country would appear if the second

method is applied.

A faceted interface has several advantages over keyword search or explicit queries: it

allows exploration of an unknown dataset since the system suggests restriction values at

each step; it is a visual interface, removing the need to write explicit queries; and itprevents dead-end queries, by only offering restriction values that do not lead to empty

results [9].

1.2.8. gFacet Project

gFacet [11] is a browsing approach that supports the exploration of RDF datasets by

combining graph-based visualization with faceted filtering functionalities. With this

combination, gFacet facilitates to explore of large and highly interrelated RDF datasets.

The major aims of the approach are:

1. Prevention of an over-cluttered graph: The facet-based visualization groupsthe instances of data into separate facets according to their characteristics.

Rather than visualizing each relation of an instance to any single instance by an

edge, the facet-based visualization allows to visualize the relationship between

an instance of a facet and one or more instances in the other facet only with a

single edge, and this relation between instances only indirectly visible when

certain facet in an instance get selected by the user in order to filter the result set.


26/84

1. Introductions

Page | 14

2. Representation of relations between facets: Visualizing the information as agraph facets as nodes and the relations between them as labeled edges can

make the hierarchy of the information more understandable.

3. Single coherent visualization: A graph-based visualization prevents the usersfrom getting lost in hyperspace by displaying all the information in a single

visualization instead of being visualized over several screens or windows.

The Architecture

gFacet is built on a Three-Tier Architecture consists of client tier, server tier, and data

tier.

Figure 1.4. gFacet Architecture

The first tier of Three-Tier Architecture is the Client Tier in which gFacet user interface

is displayed. gFacet is implemented using Adobe Flex9a framework for creating Rich

Internet Applications (RIAs) based on Adobe Flash10 platform. RIAs created with Flex

can run in every browser installed with Adobe Flash player. So this makes gFacet to be

an interactive RDF data browser that can run in every operating system as long as it has

a browser and Adobe Flash player installed.

The Server Tier is the application server also called the middleware where the

application logic and server software are stored. The middleware is implemented in

PHP. It provides the logic of query generation. It generates queries according to user

tasks and sends back the query result to the client tier. Because gFacet is a Flash-based

user interface, AMFPHP11 is used to serialize the communication between gFacet and

the PHP class objects on the server. To be able to get the data from a relational database

and transform it into an RDF model, RAP (RDF API for PHP) 12 package is used. RAP

9http://www.adobe.com/products/flex/

10http://www.adobe.com/products/flash/

11http://www.amfphp.org/

12http://www.seasr.org/wp-content/plugins/meandre/rdfapi-php/doc/


27/84

1. Introductions

Page | 15

is a software package for parsing, querying, manipulating, serializing and serving RDF

models. Then the available RDF models are manipulated using RAPs SPARQL

package.

The Data Tier is where the physical data is served for the application. The RDF data is

stored in a relational database.

The Prototype

gFacet can be accessed via the internet using a browser with a Flash player installed. In

the prototype version, gFacet uses a sample of dataset form the field of music.

Initially, a node contains a list of songs is displayed. The node can be expanded to other

related node by selecting a pair of a relationship and a facet to which it refers, from a

dropdown menu on the bottom of the node. If a user selects a pair from the dropdown

menu, a new node is opened and gets connected to the original node by a labeled edge.

The instances of the new node can act as a filter for the instances of the connected node.

Expanding the nodes gradually can create a collection of hierarchical facets as

illustrated in Figure 1.5.

Figure 1.5. gFacet user interface


28/84

1. Introductions

Page | 16

1.3. Task Description

In this Thesis, we focus on topics of accessing and querying RDF data on an arbitrary

RDF sources with SPARQL and applying it in a faceted RDF browser. We use gFacetas the platform to display the information to the users after the data manipulation in

faceted manner has been performed. But in order to make gFacet performance better in

the case of data access and availability, we need to do some modifications on the gFacet

architecture and then rebuild the SPARQL queries being used.

The Three-Tier Architecture of gFacet has drawbacks in the case of resource utilization

and execution time. The idea to prevent these issues is to make gFacet does all

computation of its logic full in the client side, so gFacet can directly access any

SPARQL endpoint without requires any server in between. We will move all the

functionalities of AMFPHP, the application logics, and RAP packages by rebuilding

similar functionalities in the Client Tier.

The second task is to build new queries that will improve the gFacet performance. The

queries should be efficient and accurate so it can support gFacet accessing a large RDF

dataset in faceted manner.

The main SPARQL endpoint and RDF dataset we use in this thesis is the one that has

been released by DBpedia Project. Because DBpedia is a large, multi domain RDF

dataset extracted from Wikipedia, so it makes DBpedia to be a good source of

information. We need to build queries that allow user to browse DBpedia in effective

and efficient way. And then we should evaluate the performane of the queries by

measuring the time required to execute the queries and measuring the accuracy.

Since DBpedia is also interlinked to various RDF sources around the Web, it allows the

user to jump from a resource in DBpedia to a related resource to an RDF dataset in

another source by following the given RDF link. A nice example is that DBpedia is

interlinked to the MusicBrainz datasets. For the third task in this thesis, we try to adjust

gFacet to be able to follow RDF links from DBpedia to MusicBrainz, so that the users

can explore the data from both sources like they were browsing from only one huge

dataset.

1.4. Related Works

SPARQL has been recommended by W3C as a standard language for querying RDF

datasets. Many approaches has been investigated how to query RDF data using

SPARQL.

Erling [21] investigated sample of SPARQL queries to be executed against large

datasets using Openlink Virtuoso SPARQL engine. The queries consist of SPARQL

extensions that work with Virtuoso at the back-end.


29/84

1. Introductions

Page | 17

Many other researches try to optimize the performance of SPARQL. [23], [26], and [24]

implemented different approaches to optimize the query, but these researches suggested

query reordering in order to get an optimized query execution plans.

Even though more and more data available over SPARQL endpoints, however it is still

difficult to integrate data from multiple data sources. RDF data integration is often done

by loading all data into single repository and querying the merged data locally [22].

Tabulator [16] uses this approach. Tabulator collects all the information by following

the related resources indicated by owl:sameAs orrdfs:seeAlso predicate andstores it in the local repository. Tabulator allows user to query against the locally-stored

data.

Quilitz and Leser [22] built DARQ13, a query engine for federated SPARQL queries. It

provides transparent query access to multiple endpoints. The implementation introduces

service description that provides the declarative descriptions of the data available fromeach endpoint, which will be used to determine the endpoint a query should be sent to.

1.5. Thesis Outline

This thesis is organized as follows

Chapter 2 explains how to make gFacet into a full client-side application. It starts with

the main strategy and then explains the new gFacet architecture and new components

that have been built during the thesis.

Chapter 3 explains in details how the generated queries work but it will give

explanation step by step according to the user action while browsing DBpedia dataset

using gFacet. At the end, this chapter will show the evaluation result based on time

measurement and accuracy measurement.

Chapter 4 describes how to make gFacet able to use more than one RDF source. In

here, gFacet will be set to be able to execute dataset both from DBedia and

MusicBrainz.

Chapter 5 will give some short summary of the implementation and evaluation of this

thesis. This chapter also introduces some ideas that can be foundation for improvements

of gFacet in the future.

13Distributed ARQ, as an extension to ARQ (http://jena.sourceforge.net/ARQ)


30/84

Page | 18

Chapter 2

The Architecture

The current gFacet works on a Three Tier Architecture consisting the Client Tier, TheServer Tier, and Data Tier. The Client Tier is where the user interface and the

presentation logic reside. The AMFPHP1, the application logic written in PHP2, and the

RAP3 packages are located in the Middle Tier. And the physical relational database is

placed in the Data Tier.

With this kind of architecture, there are two issues to be considered.

The Three Tier Architecture is resource expensive. Using middleware meansmore resources, such as more dedicated machines, more space or working

memory usages, is necessary in addition to a database server.

It is relatively time consuming. Instead of directly communicate with the DataTier, an additional processing in the Server Tier needs to be done. This means

the total execution time will increase eventually.

These issues trigger us to make gFacet has better performance by making gFacet to be a

client-side application. This chapter explains the strategy to make it possible. The

structure of this chapter is as follows. Section 2.1 describes the strategy in general.

Section 2.2 explains about the new architecture and the implementation.

2.1 The Strategy

The problems of the gFacet architecture are located in the Server Tier. There are too

much intermediate processing before users commands can be processed in the Data

Tier. The main focus of the strategy is the PHP applications logic and the RAP packages

in the Server tier. Since the PHP applications logic plays the important role that it is

where the queries are built. And the RAP packages play the role as the query dispatcher

1http://www.amfphp.org/

2http://www.php.net/

3http://www.seasr.org/wp-content/plugins/meandre/rdfapi-php/doc/


31/84

2. The Architecture

Page | 19

and as a parser for the result from the database server. We can put aside the AMFPHP

because this package is only used to serialize the communication between gFacet and

the middleware.

Mainly, there are three steps that have to be done to make gFacet a full client-side

application.

1. We have to rebuild the query builder into the Client Tier; this will replace thePHP application logic in the server.

2. In order to replace the RAP packages, we have to build a query dispatcher thatwill send the query to SPARQL endpoint; and

3. Build a parser for the result returned by the SPARQL endpoint.

Since gFacet is implemented using Adobe Flex4, all the new components will be built in

a client-side scripting language, Actionscript5, as the core scripting language of Flex. By

this implementation, the application logic will fully run in the client.

2.2 Client-side gFacet Architecture

By accomplishing all the three steps, mentioned in the previous section, we have

simplified the architecture of gFacet. We have moved all the necessary functionalities

such as query building and dispatching, and result parsing into the Client Tier. So now

the new gFacet will completely run in the client-side. And this means that gFacet is nowbuilt on a Two-Tier Architecture consisting Client Tier and Data Tier as illustrated in

Figure 2.1. With this architecture users request can be invoked to the SPARQL engine

without any intermediate processing in between.

Figure 2.1. gFacet Two-Tier Architecture

4http://www.adobe.com/products/flex/

5http://www.actionscript.org/


32/84

2. The Architecture

Page | 20

In the next subsections, we will explain briefly the three architecture components which

we develop during the thesis; the query builder, the query dispatcher, and the SPARQL

result parser. The development of presentation handler, the action listener, and also the

SPARQL endpoint is out of scope of this thesis, because they have been establishedeven before the thesis started.

2.2.1 SPARQL Query Dispatcher

A SPARQL endpoint allows SPARQL query to be conveyed as a HTTP request over

the Web using a GET or POST method. This HTTP request is assembled and sent to the

SPARQL endpoint by the query dispatcher. The request package contains some

parameters that required by the SPARQL endpoint. The parameters are given as follows

1. query specifies the query pattern will be executed.

2. default-graph-uri which specifies the graph to be used to form the defaultgraph. Specifying this parameter will overwrite the defined default graph in the

query pattern using FROM clause.

3. output which specifies the result format to be returned. In this application weexpect an XML document of the SPARQL query result.

The query dispatcher is derived from the HTTPService class of the Actionscript, as

shown in Figure 2.2. The send() method of the HTTPService object is able to send

a HTTP request to a host specified by the url variable, and an HTTP response isreturned. The method is also able to pass parameters to the specified url. Hence, inside

the execute() method ofSPARQLQuery object, the required SPARQL parameters

explained before are packaged together into one object variable called parameters

which will be sent within the HTTP request by calling the sent() method.

Figure 2.2. SPARQLQuery (the dispatcher) class diagram


33/84

2. The Architecture

Page | 21

The abstract HTTP trace example in Listing 2.1 illustrates the invocation of the

SPARQL query in http://example.org/sparql/ SPARQL endpoint with a GET sending

method. The EncodedGraphURI and EncodedQuery are equivalent

representation of the graph URI and the query pattern that have been encoded.

GET /sparql?default-graph-uri=EncodedGraphURI&query=EncodedQuery&

output=xml HTTP/1.0

Host: example.org

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;

rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8

Listing 2.1. HTTP trace of query dispatching

2.2.2 SPARQL Result Parser

W3C recommends a standardized XML document as an optional SPARQL result format

to be easily serialized by any applications. We build a parser for this XML document in

order to understand the document structure and then to parse the elements into

Actionscript native datatype, which are required for further processing.

All variable names found inside will be stored in an array datatype. All

the results of a SELECT query will be stored in a multidimensional array, which each

row represent one query solution.

One thing to be considered, that each binding variable inside a solution could be a value

of a resource/URI, literal, or blank node. This RDF Term is defined in ,, or , and each of it has different behaviors as modeled in

Actionscript classes shown in Figure 2.3.

Figure 2.3. RDFTerm class diagram


34/84

2. The Architecture

Page | 22

So, the array representation of the variable names and the query solution given by an

example XML document in the Listing 1.7 will be represented in Listing 2.2 as follows

arrVariables = [name, "mbox];

arrResult =

[

[ name => new Literal(Bobby Iceman, null,

http://www.w3.org/2001/XMLSchema#string),

mbox => new Resource(mailto:[email protected])

] ,

[ name => new Literal(Tony Ironman, null,

http://www.w3.org/2001/XMLSchema#string),

mbox => new Resource(mailto:[email protected])

]

]

Listing 2.2.Native Actionscripts datatypes representing the query results

The first query solution denoted by arrResult[0] consists of variable binding name

and mbox. The value of variable binding name is a Literal object with a label Bobby

Iceman as a string datatype. And the value ofmbox is a Resource object with a URI of

mailto:[email protected]. And so on with the second query solution.

A element is a result of ASK query. The value of this element could be

TRUE or FALSE. This value will be stored as a string Actionscript datatype.

2.2.3 SPARQL Query Builder

We implement the query builder by mainly rebuilding the PHP application logic located

in the server into the Client Tier. The query builder is responsible to generate SPARQL

queries and send the query patterns to the query dispatcher.

The queries are generated according to the received commands from a caller function.

The queries are then sent to a SPARQL endpoint by the query dispatcher. Eventually,

after the execution is accomplished; the endpoint will send back the result as an XML

document. Then the XML document will be parsed and the query results will be

converted into a multidimensional array datatype. This conversion is necessary to be

done, in order to make further processing easier. Then the array representation will bemanipulated to extract the required data, before it is sent back to the caller function.

Finally the presentation handler is responsible to display the data to the user. The

dataflow is illustrated in Figure 2.4.


35/84

2. The Architecture

Page | 23

Figure 2.4. gFacet Data Flow Diagram


36/84

Page | 24

Chapter 3

Browsing DBpedia

The previous chapter explained about the new architecture of gFacet. Now theapplication logic resides in the Client Tier including the Query Builder class where all

the queries according to the users commands are built. This chapter will look deeper

into the queries which are constructed while user interacts with gFacet. The Query

Builder will interpret every users action into queries that will be sent to the SPARQL

endpoint.

Instead of describing the queries in abstraction, we will explain the query case by case

according to users action while exploring an RDF dataset with gFacet. In this chapter,

we will use dataset released by DBpedia1 project. DBpedia dataset contains multi

domains information which is extracted from structured information from Wikipedia2.

In this chapter, we will also evaluate the queries execution time and its consistency to

give the correct result.

This chapter is structured as follows. Section 3.1 gives short explanation about

dispatching a query to DBpedia. Section 3.2 explains query implementation while

browsing the DBpedia. And then we run the evaluation in Section 3.3.

3.1. Dispatching a Query to DBpedia

DBpedia released its dataset in order to make it available over the Web. The dataset can

be accessed as a Linked Data or via a SPARQL query endpoint. In this the thesis, we

focus on accessing DBpedia dataset by executing queries to the DBpedia SPARQL

endpoint in order to get the result to be viewed using gFacet.

1http://dbpedia.org/

2http://wikipedia.org/


37/84

3. Browsing DBpedia

Page | 25

The queries can be sent via HTTP request to the endpoint located at

http://dbpedia.org/sparql. The endpoint is provided using OpenLink Virtuoso3 virtual

database engine.

As explained in Section 2.2.1, the endpoint location and the sending method need to be

predefined and then the query can be sent along with other parameters required by the

endpoint, such as the default graph and the result format. In gFacet, dispatching a query

is done as follows

1: var endpoint:String = http://dbpedia.org/sparql;2: var aQuery:SPARQLQuery = new SPARQLQuery(endpoint);3: aQuery.defaultGraphURI = "http://dbpedia.org";4: aQuery.method = "GET";5: aQuery.resultFormat = "xml";6: aQuery.execute();

Listing 3.1. Dispatching a SPARQL query to DBpedia

Listing 3.1 describes query dispatching to DBpedia endpoint as a GET method. The

query is executed against the DBpedia default graph which located at http://dbpedia.org.

We expect the result as an XML document.

The queries that have been built in this thesis are not standard SPARQL queries like the

W3C recommendation. There are some cases where the standard SPARQL cannot carry

out the tasks. In this thesis we use SPARQL extension such as free-text searching, and

aggregating COUNT() function proposed by OpenLink Virtuoso. So, the queries webuilt may not work in other SPARQL query service except for OpenLink Virtuoso.

3.2. Exploring DBpedia with gFacet

In this section we will describe the detail implementation of the queries generated by

gFacet tool. We use DBpedia dataset as the source of our information.

The explanation in this section is given step by step according to the users goals. Lets

make an example case and then we will explain how the queries and the gFacet user

interface achieve this goal from such a large interlinked dataset like DBpedia. The caseis given as follows

A user is very interested in German Football Clubs. He needs to see all the relevant

clubs. He keeps exploring the dataset by looking for information of German Footballers

that plays for the German clubs, then English Football Clubs that ever hired the German

footballers, and Football Venues where the English clubs reside. And then he needs to

find the German football club for which a German Footballer named Thomas

Hitzlsperger plays. But after that, he is no longer interested in German Football Clubs.

3http://virtuoso.openlinksw.com/
http://dbpedia.org/sparqlhttp://dbpedia.org/http://dbpedia.org/http://dbpedia.org/sparqlhttp://dbpedia.org/sparql


38/84

3. Browsing DBpedia

Page | 26

He is changing his point of view. Now he is more interested in English Football Clubs

and continues the exploration then.

Lets formulate the case in simple way

Goal 1 : User is interested in information about German Football Clubs and he wants to

see all the list of the clubs

Goal 2 : User expands the graph by looking for

(a) All the names of German Footballers that play the for the German Clubs,

(b) All the English Clubs for which the footballers in (a) had ever played, and

(c) The venues where the English Clubs in (b) reside.

Goal 3 : User wants to see the German Football Club where Thomas Hitzlsperger plays

for. He selects the players name in order to filter the clubs and receive the

information he needed.

Goal 4 : While exploring the user decides to be more interested in English football clubs

than in German ones. Now he changes his perspective on the information.

All the steps to achieve this goal are shown in Figure 3.1 and all the details of every

steps will be described in the several next sections.

Figure 3.1. User actions flow while browsing DBpedia

3.2.1. Searching for Concepts

One of the gFacet features is a capability to search for a concept in order to define the

initial node to begin with the exploration. User can specify a string of keyword to be

matched with the concept name. A list of concepts and the amount of instances, which

each concept has, will be displayed as a return (see Figure 3.2).


39/84

3. Browsing DBpedia

Page | 27

According to Goal 1, the user is interested in information about German football clubs.

A screenshot of gFacet interface, Figure 3.2, shows the user specifying his idea by

typing the keyword german football. Then a list of concepts that contains german

football text will be displayed also with the number of instances that belong to eachconcept.

Figure 3.2. Searching a concept

Free-text searching within DBpedia texts can be performed using bif:containspredicate which has been proposed for SPARQL extension by Openlink Virtuoso. Since

Virtuoso 5.0, it is possible to declare RDF object of triples with a given predicate or

object get indexed [25]. Using bif:contains, the triples that have been indexed canbe found.

Actually, there is a more-generic way of searching texts; by using standardized

SPARQL function regex(). But we prefer to use bif:contains predicate rather

than regex() function. bif:contains looks for the objects from the indexing

table, instead of searching in the whole dataset like regex() does. This makes the

bif:contains works faster especially if the query is executed against a large

dataset. So, for a large dataset like DBpedia, using bif:contains is morereasonable.


40/84

3. Browsing DBpedia

Page | 28

The query in Listing 3.2 is the generated query from the example illustrated in Figure

3.2.

1: SELECT DISTINCT ?concept ?label COUNT(?instance) AS ?numOfInstances2: WHERE {3: ?concept rdf:type skos:Concept .4: ?instance skos:subject ?concept .5: ?concept rdfs:label ?label .6: FILTER (lang(?label) = "en")7: ?label bif:contains "german and football" .8: }9: ORDER BY DESC(?numOfInstances) LIMIT 30

Listing 3.2. Query for concept searching

In Line 3, we are looking for resources that are identified as a concept. We bind theseresources as variable concept. Line 4 searches for instances of the concept, we bind

to variable instance. In Line 5, we use predicate rdfs:label to get a human-readable version of a resources name. Here we want to get a label of the concept and

we need only the categories which has label presented in English as we specified in

Line 6. In Line 7, we apply the bif:contains into the label of the concept by

specifying a string of keyword. Back in Line 1, we define the variable concept,

label, numOfInstances to be returned in the solution sequence. We calculate the

number of instances using SPARQL extension function, COUNT().

Due to so many possibilities of concepts to be found, we ask the endpoint to order the

result in descending (line 9) according to the number of instances each concept has, so

the concept at the top of the list might be the most relevant concept for the user.

To make the queries look simpler for explanation, we assume that all the resources

label will be returned only in English version. So for the next queries, we skip the query

patterns like in line 5 and line 6.

3.2.2 Selecting the Initial Node

In the list of concepts, the user sees the concept he is looking for, the German football

clubs. So he selects the concept and then the initial node of this concept will be opened,

as shown in Figure 3.3.


41/84

3. Browsing DBpedia

Page | 29

Figure 3.3. Opening the initial node

The initial node is set as the result set by default. Result set is the information space that

the user interested to. This is the perspective of the user when he looks at the

information. All the things that he searches for are displayed in this node. Result set

node always appears with a dark grey color.

In gFacet, a node will have a list of the instances of a certain concept that can be paged

through, a pull-down menu of relations, and a button to set the node as the result set (see

Figure 3.3). The information that is required by the user will be displayed on the

instances list. The user can navigate from page to page to explore these instances. By

opening the initial node of German Football Clubs, Goal 1 has been achieved.

In the next subsections we will explain the details of the generated queries to retrieve all

the instances, the paging mechanism, and to obtain all the relations.

Retrieving the instances

The instances will be displayed in a list consists of label and description of the

instances. In order to avoid extensive scrolling when there are so many instances

displayed at once. gFacet provides a paging mechanism.

By using a paging mechanism, there are at least two tasks need to be done by the

application.

1. The application should be able to count the amount of all possible instances.This value is very important to determine how many page buttons should be

made.

2. The application should be able to produce just a subset of instances to bedisplayed according to the page selected by the user.


42/84

3. Browsing DBpedia

Page | 30

In this thesis we try two approaches to retrieve the data and apply the paging

mechanism; the all-at-once querying and step-by-step querying.

All-at-once querying: In this first approach, we solve the tasks by using a standardSPARQL query in order to make the query more generic. In simple words, the query

retrieves all the possible instances from the endpoint into the Client Tier and then the

client application accomplish the paging mechanism by grouping the instances into

several pages and producing a subset of instances to be displayed. The query for this

approach is given as follows

1 : SELECT DISTINCT ?insOfresultSet ?comment_resultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {

6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }

Listing 3.3. Query to retrieve all intances

The query in Listing 3.3 is just simply asking if the URI of the selected concept is really

defined as a concept (Line 3). And if it is really a concept, then Line 4 searches for the

instances of it, by applying the skos:subject predicate.

In the Line 6 and Line 7, the query requires for an English description of the concept,but we put this requirement into an OPTIONAL clause (Line 5), which means that thepatterns are not necessarily to have a binding result.

The advantage of this approach is that once all the data has been retrieved, exploring the

pages will be comfortably fast, because the application does not have to execute queries

anymore. However, the drawback of this approach is that it tends to take a long time for

the query engine to query all existing instances of a certain concept especially if there

are a lot of instances available.

The second drawback of this approach is caused by the result limitation for every query

execution. To protect service from overload, the SPARQL endpoint truncates queryresults into only 1000 rows every execution [6]. This makes gFacet cannot get the rest

of instances if the concept has more than 1000 possible instances.

Step-by-step querying: To solve the drawbacks of the first approach, we introduce the

Step-by-step querying approach. We try to accomplish the tasks by executing two

queries; one query of each task. The first query is to ask the query engine to return only

a subset of instances, instead of asking the whole query solutions to be returned. This

task is possible to do by using standard SPARQL clauses, OFFSET and LIMIT (see

Listing 3.4).


43/84

3. Browsing DBpedia

Page | 31

1 : SELECT DISTINCT ?insOfresultSet ?comment_resultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }10: ORDER BY ASC(?label_insOfresultSet) OFFSET 0 LIMIT 10

Listing 3.4. Query to retrieve a subset of instances

In general the query is the same with the one in Listing 3.3, the difference, which is theimportant thing of the query, is located in line 10. First, we ask the results to be ordered

by the label of the concept in ascending using ORDER BY ASC() clause. And then the

main focus of this approach is done by specifying OFFSET index and LIMIT clauses toget the subset of the available query solutions. We predefine a limit of 10 instances to

be returned at a time, starting from the defined value in OFFSET clause. These three

clauses are needed for paging mechanism; by ordering the solutions before OFFSET-ing

we will get a consistent and meaningful order. For example, ifOFFSET is set to 0, the

query returns instance #1 to instance #10; if the OFFSET is set to 10 we will getinstance #11 to #20, and so on.

The second query in this approach is then asking the amount of the possible solutions.But there is no function of standardized SPARQL capable to do this task. That is why

we need to use COUNT() function, which is also a SPARQL extension function

provided by Openlink Virtuoso, to calculate the amount of instances. We execute the

similar query to the first one, but we change slightly what to be returned by the

SELECT clause. In line 1, we can see that the query only needs the overall amount of

possible instances by specifying COUNT() function inside the SELECT clause (see

Listing 3.5).

1 : SELECT COUNT(DISTINCT(?insOfresultSet)) AS ?totalNumber_concept2 : WHERE {

3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }

Listing 3.5. Query for obtaining a total number of possible solutions

Even though we generate two independent queries for this approach, the application

works significantly faster by fetching just a subset of the instances, rather than using the


44/84

3. Browsing DBpedia

Page | 32

All-at-once querying approach that fetches all instances into the client. And with Step-

by-step approach, gFacet does not have a problem with concepts that have more than

1000 instances. gFacet can explore all the instances from the first to the last instance

without any limitation.

The drawback of this approach is that it still has to send queries if the user moving from

page to page, thus in this case this approach runs less fast than the All-at-once querying

approach. Despite of this drawback, the two advantages mentioned before state that the

Step-by-step querying approach is more suitable for gFacet.

We will also omit similar patterns to get the description of an instance, like Line 6 and

Line 7, in the next sections to make the query looks simpler for explanation.

Retrieving the Relations

Each node in graph will have a relations list that gives all the available relations to any

nodes related to the current one. These relations are given as a list in a drop-down menu

(see Figure 3.4). The relations are presented as pairs of RDF predicate and the related

concept name (predicate:nextConceptName). The amount of the related instances of

the new concepts is also displayed in the list of relations. By selecting any of these

relations, a new node related to the current one will be opened.

Figure 3.4. The Relation List

As an example, we take the second row of the relation list. A relation of

name:German_footballers means that one or more instances of the nodeGerman Football Clubs are related to an arbitrary number of resources by the RDF

predicate name. From these resources, there are 302 resources which are instances of


45/84

3. Browsing DBpedia

Page | 33

concept German Footballers. The connection between theses resources with concept

German Footballers are defined by the predicated skos:subject. The RDF graphfor this case is illustrated in Figure 3.5.

Figure 3.5. Constructing a pair of predicate and related concept

In Listing 3.6, we provide a query to fill the relations list. The query searches all the

predicates that semantically connecting the instances of the current concept with any

other resources. Then the query will search all the new concepts to which these

resources belong.

The important part of the query is on the Line 6 and Line 7. In Line 6, once the

instances of the current concept have been found, the query looks for any relatedresources of these instances based on certain RDF predicates. And in Line 7, the query

searches the new concepts to which these resources belong.

Line 2 calculates the number of instances for each new concept.

1 : SELECT DISTINCT ?prop ?newConcept2 : COUNT(DISTINCT ?instNewCat) AS ?numOfInstances3 : WHERE {4 : rdf:type skos:Concept .5 : ?instCurrConcept skos:subject .6 : ?instCurrConcept ?prop ?instNewConcept .

7 : ?instNewConcept skos:subject ?newConcept.8 : ?newConcept rdf:type skos:Concept .9 : } ORDER BY DESC(?instNewConcept) ?prop ?newConcept LIMIT 40

Listing 3.6. Query for obtaining all the pairs of predicate and concept

Using combination of RDF predicate and concept name for constructing a facet brings

into a relatively time-consuming query execution in the SPARQL engine, especially if

the current node has a large amount of possible instances. This is because every instance

could have a lot of predicates referring to other resources and then the query should

look for the concepts each of these resources belongs to. It is hard to handle a huge


46/84

3. Browsing DBpedia

Page | 34

number of combinations of both. This issue makes the execution in SPARQL endpoint

takes a long time to complete and in the worst case, execution time limit is exceeded.

To prevent the problem above, we set a limit number of 40 relations should be returnedfor this query. We order the relation list by the number of related instances in the new

concept in descending way (line 9). We realize that displaying only 40 relations is not

enough to generalize all the possible combination of relations, however by ordering the

number of related instances in descending way, then the most-likely important relations

for users will be viewed at the top of the list.

3.2.3. Expanding the Graph

Now we move to the Goal 2, which is to expand the graph by adding more nodes into it.

A new node is opened if user selects a certain relation from the drop-down relation list.

An edge will be created and labeled as the predicate selected by the user, as shown in

Figure 3.6. This edge will relate the current node and the new node, and indicate the

semantic relation between both of them. Gradually expanding the nodes will create a

chain of nodes which represents hierarchical facets. This new nodes act as the

constraints directly or indirectly for the result set.

In our case, the Goal 2a is to see all the German footballers and our starting node is the

initial node German Football Clubs. So from the relation list in the initial node, the user

looks for a relation that might be appropriate for his requirement. So he selects the

relation name:German_footballers and the new node will be open as shown inFigure 3.6.

Figure 3.6. Opening a new node by selecting a relation


47/84

3. Browsing DBpedia

Page | 35

At this point, Goal 2a has been done. The Goal 2b and Goal 2c are done similarly. To

get the English football clubs that are related to the German footballers, user selects a

relation clubs:English_football_clubs from the list in node German

Footballers. Then after the node English Football Club is opened, user can selectrelation ground:Football_venues_in_England to see all the stadium wherethe related English clubs reside. A screenshot of a chain of 4 nodes is presented in

Figure 3.7.

Figure 3.7. A chain of 4 nodes is created after user gradually expanding the nodes

In building a query for a chain of nodes, a special characteristic between a child node

and its parent has to be considered. An instance of a child node will not be displayed if

the instance is not related to any visible instance of its direct parent node. Figure 3.8

will demonstrate how the parent node and child node interact.


48/84

3. Browsing DBpedia

Page | 36

Figure 3.8. Model a chain of 4 nodes describing parent-child characteristic. An instance of child node

must related to at least a visible instance of its parent node Objects with dotted outline are not visible.

We can see in Figure 3.8, that there is no German club that hires a player with a name

B4, so that is why player B4 is not visible in the node German Footballers. And sobecause B4 is not visible, instance C1 in node English Football Clubs will be not

displayed also.

This characteristic is intentionally meant so that only relevant instances of a child node

can be used to filter instances of its direct parent or to filter indirectly the instances of

the result set.

Our approach to express this characteristic is by using nested OPTIONAL clauses. Each

child node has to be written inside an OPTIONAL clause. By using this clause, each

visible instance of parent node does not necessary to have a related instance in its child

node. But, each instance that is visible in the child node must have at least a relationwith visible instance in its parent node. The generated query for the chain shown in

Figure 3.8 is described in Listing 3.7.

1 : SELECT ?instResultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?instResultSet skos:subject .5 : OPTIONAL6 : {7 : rdf:type skos:Concept .8 : ?instOfB skos:subject .


49/84

3. Browsing DBpedia

Page | 37

9 : ?instResultSet dbpedia2:name ?instOfB .10: OPTIONAL11: {12: rdf:type skos:Concept .

13: ?instOfC skos:subject .14: ?instOfB dbpedia-owl:clubs ?instOfC .15: OPTIONAL16: {14: rdf:type skos:Concept .15: ?instOfD skos:subject .16: ?instOfC dbpedia2:ground ?instOfD .17: }18: }19: }20: }21: ORDER BY ASC(?label_instResultSet) OFFSET 0 LIMIT 10

Listing 3.7. Query for a chain of 4 nodes; The result set is the initial node

We can see that every child is written inside a nested OPTIONAL clause. The

indentations in Listing 3.7 show the level of the nodes in the graph. In line 3 4, the

result set is defined as the node A. Node A has a child which is node B, and the query

patterns for B are written in an OPIONAL clause. In line 78 the query searches for all

resources that are instance of node B. In line 9, here we declare the dependency between

node A and node B which is defined by the predicate dbpedia2:name. And so withLine 1214 and Line 1416 for class C and D. Because node C is a child of node B,

then all query patterns for C are written inside OPTIONAL clause too.

3.2.4. Filtering

The idea of exploration with gFacet is to restrict the available instances in the result set

by selecting arbitrary restriction values so that the user can find the relevant

information. Exploring data with gFacet eases the user by constructing the selection

queries automatically every time the user adds a constraint. First, user can only select a

filter instance at once and then gFacet will display the intermediate results in the result

set before user applying more selections.

We now describe closely the filtering operations that can be done with gFacet. There arefour filtering operations in gFacet: basic filtering, hierarchical filtering, union filtering,

and intersection filtering. gFacet allows a combination of operations as desired by the

user.

Basically, the filtering is propagated upward from the selected node until the result set.

While propagating upward, there might be intermed

Documents

Faceted Exploration of Multiple RDF Data Sources Using SPARQL