Upload
hanief-bastian
View
218
Download
0
Embed Size (px)
Citation preview
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
1/84
Faceted Exploration of Multiple RDFData Sources Using SPARQL
by
Hanief Bastian(2217547)
Submitted in partial fulfillment
of the requirements
for the degree of
Master of Science in Computer Engineering
Supervisors:
Prof. Dr.-Ing. Jrgen Ziegler
Dipl.-Inform. Philipp Heim
University of Duisburg-Essen
Faculty of Engineering
Department Computer and Cognitive Sciences
Institute of Interactive Systems and Interaction Design
6 May 2009
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
2/84
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
3/84
i
Abstract
Many applications that access the Semantic Web are structured in Three-Tier
Architecture consisting of Client Tier, Server Tier, and Data Tier. With the growing
number of SPARQL endpoints, parts of the data access logic have moved to the Data
Tier. This allows the query building process to be shifted to the Client Tier and
therewith ease the resource and the performance cost to access information contained in
the Semantic Web.
In this thesis, we describe the transformation from a Three-Tier Architecure to a Two-Tier Architecture using the example of gFacet, a tool for graph based faceted access to
the Semantic Web and we support the abilities of gFacet tool by generating efficient
SPARQL queries on the client-side. The former Three-Tier Architecture of gFacet did
not efficiently access the Semantic Web via SPARQL endpoint, mainly because the
intermediate processing in the Server Tier could increase the total execution time. This
was the reason to reconstruct the architecture as well as the whole query building
process of the gFacet tool by moving all the functionalities of the application server into
the Client Tier and improve the performance of the queries in order to support an
efficient and client-side access to any SPARQL endpoint and thereby to various
information contained in the Semantic Web.
We provide all the queries that allow faceted exploration on a large RDF dataset. In this
thesis, we use an RDF dataset released by DBpedia. All the queries that support gFacet
to search a certain concept, to retrieve and filter the information, and to change the
information point-of-view are described in detail and evaluated regarding their
performance. We implement two different approaches to retrieve a large amount of
instances that enable paging through these instances; by retrieving all instances at once
to the client using standard SPARQL and by retrieving a subset of the possible instances
using SPARQL extensions.
We facilitate the functionality of gFacet by providing the opportunity to explore more
than one RDF source. As an additional RDF dataset, we use RDF data released by
MusicBrainz. With this feature, gFacet can search for more information by exploring
the additional data source.
Keywords: SPARQL, gFacet, RDF, DBpedia, faceted exploration, Semantic Web,
SPARQL query optimization, multiple SPARQL endpoints access
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
4/84
ii
Acknowledgements
First I am so grateful to Allah SWT for the health, ideas, and everything that make this
thesis accomplished. I also offer my sincerest gratitude to Prof. Dr.-Ing. Jrgen Ziegler
and Dipl.-Inform. Philipp Heim. Without the guidance, the great efforts, and great ideas
they have been given, this thesis would not have been completed or written. I simply
could not wish for more supportive and friendlier supervisors.
I wish to express my warm thanks to my brothers at home Mas Andy, Dicky, and Evan.
The encouragement they always give me is so meaningful to me.
Lastly, and most importantly, I wish to thank my parents, Papa Eddy Purwanto and
Mama Latifah. They bore me, raise me, support me, and love me. Thank you for the
great opportunity you gave me, so I can be here standing even though away from you
both. To them I dedicate this thesis.
Hanief Bastian
May 2009
Duisburg-Germany
http://www.interactivesystems.info/Mitarbeiter/Personen/Heimhttp://www.interactivesystems.info/Mitarbeiter/Personen/Heim7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
5/84
iii
Contents
Chapter 1: Introduction ................................................................................................ 11.1.Motivation ............................................................................................ 11.2.Starting Point ........................................................................................ 2
1.2.1.Semantic Web ............................................................................ 21.2.2.Resource Description Language (RDF) ..................................... 31.2.3.SPARQL .................................................................................... 41.2.4.DBpedia Project ....................................................................... 101.2.5.MusicBrainz ............................................................................. 121.2.6.DBTune .................................................................................... 121.2.7.Faceted Navigation .................................................................. 131.2.8.gFacet Project ........................................................................... 13
1.3.Task Description ................................................................................. 161.4.Related Works .................................................................................... 161.5.Thesis Outline ..................................................................................... 17
Chapter 2: The Architecture ...................................................................................... 182.1.The Strategy........................................................................................ 182.2.Client-side gFacet Architecture .......................................................... 19
2.2.1.SPARQL Query Dispatcher ..................................................... 202.2.2.SPARQL Result Parser ............................................................ 212.2.3.SPARQL Query Builder .......................................................... 22
Chapter 3: Browsing DBpedia ................................................................................... 243.1.Dispatching a Query to DBpedia ........................................................ 243.2.Exploring DBpedia with gFacet ......................................................... 25
3.2.1.Searching for Concepts ............................................................ 263.2.2.Selecting the Initial Node ......................................................... 283.2.3.Expanding the Graph ............................................................... 343.2.4.Filtering .................................................................................... 373.2.5.Result Set Pivoting ................................................................... 43
3.3.Evaluation ........................................................................................... 45
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
6/84
iv
3.4.Time Measurements ........................................................................... 453.5.Correctness Measurements ................................................................. 48
Chapter 4: Multiple Sources ....................................................................................... 52
4.1.The Strategy........................................................................................ 534.2.The Obstacles ..................................................................................... 55
4.3.Finding the Equivalent Data ............................................................... 554.4.Transforming the URIs ....................................................................... 57
4.4.1.Zitgist URI to DBTune URI Conversion ................................. 574.4.2.MusicBrainz Scheme to Music Ontology Mapping ................. 58
4.5. Implementation ................................................................................... 59Chapter 5: Conclusions & Future Works ................................................................. 64
5.1.Conclusions ........................................................................................ 645.2.
Future Works ...................................................................................... 655.2.1.Autocompletion Text Search ................................................... 655.2.2.Searching for Instances ............................................................ 665.2.3.Automatic Data Interlinking .................................................... 66
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
7/84
v
List of Figures
1.1. RDF graph representation ........................................................................... 41.2. RDF graph example .................................................................................... 51.3. Linking Open Data cloud ............................................................................ 111.4. gFacet Architecture ..................................................................................... 141.5. gFacet user interface ................................................................................... 152.1. gFacet Two-Tier Architecture .................................................................... 192.2. SPARQLQuery (the dispatcher) class diagram .......................................... 202.3. RDFTerm class diagram ............................................................................. 212.4. gFacet Data Flow Diagram ......................................................................... 233.1. User actions flow while browsing DBpedia .............................................. 263.2. Searching a concept .................................................................................... 273.3. Opening the initial node .............................................................................. 293.4. The Relation List ........................................................................................ 323.5. Constructing a pair of predicate and related concept .................................. 333.6. Opening a new node by selecting a relation ............................................... 343.7. A chain of 4 nodes is created after user gradually expanding the nodes .... 353.8.
Model a chain of 4 nodes describing parent-child characteristic ................ 36
3.9. A constraint selected in direct child of result set triggers a basic filtering . 383.10. Hierarchical filtering ................................................................................... 393.11. Union filtering ............................................................................................ 413.12. Intersection filtering .................................................................................... 423.13. A chain of 4 nodes after pivoting ............................................................... 433.14. A model of a chain of 4 nodes after pivoting ............................................. 443.15. Average elapsed time for nodes with certain amount of instances ............. 463.16. A chain of 4 nodes for evaluation ............................................................... 47
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
8/84
vi
3.17. Sample dataset : A chain of 4 nodes with Britpop Musical Group as theresult set ...................................................................................................... 48
3.18. Relationships diagram for the sample dataset ............................................. 494.1. Data sets that have been published and interlinked by Linking Open Data
project (March 2009) ................................................................................... 53
4.2. gFacet model of multi sources exploration ................................................. 544.3. RDF graph of a subject with three equivalent resources ............................ 564.4. Two equivalent resources from (a) dbPedia, (b) MusicBrainz ................... 574.5. Extracting the class and UUID of MusicBrainz to be mapped to DBTune
URI scheme ................................................................................................. 58
4.6. Data flow diagram of gFacet multi sources exploration ............................. 594.7. Relation list with sameAs:musicBrainz element ........................................ 614.8. A chain of two nodes from distinct sources ................................................. 61
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
9/84
vii
List of Listings
1.1. N3 statements .............................................................................................. 31.2. RDF/XML serialization .............................................................................. 41.3. Simple query obtaining a name and mbox of a FOAF profile ................... 51.4. A query obtaining a name and mbox of a FOAF profile with value
constraint ..................................................................................................... 6
1.5. A query obtaining a name and mbox of a FOAF profile with OPTIONALclause .......................................................................................................... 6
1.6. Simple query getting a name and mbox of a FOAF profile with groupgraph pattern ............................................................................................... 7
1.7. Example of COUNT clause ........................................................................ 81.8. Example of subquery .................................................................................. 81.9. SPARQL query results in XML format ...................................................... 82.1. HTTP trace of query dispatching ................................................................ 212.2. Native Actionscripts datatypes representing the query results .................. 223.1. Dispatching a SPARQL query to DBpedia ................................................. 253.2. Query for concept searching ....................................................................... 283.3.
Query to retrieve all intances ...................................................................... 30
3.4. Query to retrieve a subset of instances ....................................................... 313.5. Query for obtaining a total number of possible solutions ........................... 313.6. Query for obtaining all the pairs of predicate and concept ......................... 333.7. Query for a chain of 4 nodes; The result set is the initial node .................. 363.8. Query for basic filtering. Constraint : footballer Thomas Hitzlsperger ...... 383.9. Query for hierarchical filtering. Constraint : English Club C4.................... 403.10. Query for union filtering. Constraint : English Club C2 and C4 ................. 413.11. Query for intersection filtering. Constraint : English Club D1 and C4 ....... 423.12. Query for result set pivoting. Result set : node C ....................................... 44
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
10/84
viii
4.1. Query to get all relations and count the interlinked data of a concept ........ 604.2. Dispatching a SPARQL query to DBTune for MusicBrainz dataset .......... 624.3. Query to get the interlinked data from DBtune for MusicBrainz dataset ... 625.1. Query to search for instances ...................................................................... 66
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
11/84
ix
List of Tables
1.1. SPARQL result for simple query .................................................................... 61.2. SPARQL result with OPTIONAL clause ....................................................... 73.1. The instances of each evaluation node ........................................................... 463.2. Average of elapsed time between HTTP request and response for a chain
of 4 nodes ........................................................................................................ 47
3.3. Result of OR-operation test cases ................................................................... 503.4. Result of AND-operation test cases ................................................................ 514.1. Comparison MusicBrainz entitys type and Music Ontology class ............... 58
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
12/84
x
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
13/84
Page | 1
Chapter 1
Introduction
1.1. Motivation
The Semantic Web [1] was introduced to extend the power of the current Web by
making its content understandable to machines and thus allow machines to perform
automated information gathering and to obtain more meaningful results. This requires
the semantics of content in the Web to be described in a machine-readable form by
using formal languages like RDF [14] and OWL [20]. These languages allows web
content to be assigned to semantically defined concepts and related to one another by
semantically defined relationships. That way annotated, information can be found more
efficiently and with more certainty.
With the steady growth of the Semantic Web, more and more annotated information is
published on the Web, leading to a growing number of RDF datasets. Even though RDF
data is originally meant to be read by machines, information about the meaning of Web
content and its interrelations can be highly valuable for humans, too. However, there is
no defined method to render RDF data in a way that can be easily understood by
humans, in comparison to, for example, HTML [15], where markups are used for a
proper presentation. In order to let also humans benefit from information contained in
RDF datasets, methods are needed to access and to render this information in an
appropriate way.
One promising way to access information that is contained in RDF data is offered by thetool gFacet [11]. It combines graph-based visualization with faceted filtering
functionalities to build up queries and thus control what information is displayed on the
screen. The queries are formulated in SPARQL [9], the W3C recommendation to access
RDF data. SPARQL has a SQL-like syntax and can be used to express queries across
diverse data sources.
With the existing SPARQL endpoints around the Web that allow RDF dataset to be
queried in order to get the results, gFacet can be the bridge between user and the RDF
dataset. Efficient and accurate queries need to be implemented into gFacet, so that
gFacet can be a powerful faceted RDF browser. But in doing this, we also have to find
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
14/84
1. Introductions
Page | 2
the correct architecture for gFacet. The architecture has to be simple but efficient and
supports direct connection with SPARQL endpoint in order to enable active accessing
to any SPARQL endpoint.
In general, the main focus of this thesis is given as follows
Building an efficient architecture for gFacet. We find that Two Tier architectureis suitable for gFacet. With Two-Tier Architecure, gFacet tool in the Client Tier
can directly communicate with a SPARQL endpoint.
Optimize the query performance of the current gFacet by building new efficientand accurate queries in order to support faceted exploration over a large RDF
dataset.
Allowing multi RDF sources exploration using gFacet. With this feature gFacetcan browse an entity from a source to the same entity to the other sources.
1.2. Starting Point
This section gives brief description about the platforms, technologies, or projects that
are used as the foundation of the thesis. We adapt this foundation in order to make it
applicable to our work.
1.2.1. Semantic Web
Since the WWW began in the early 1990s, WWW has given a great impact for mankind
in information, education, business, and even social life. From time to time the number
of websites on the web keeps growing. According to Google Blog1, the Google index
reached 1 trillion unique URLs on the web by the end of July 2008, for comparison that
the first Google index in 1998 had 26 million pages. However, most of web pages
currently are still in the form what we called the Syntactic Web. The syntactic web
focuses only on the visual presentation of the content. Once the content been displayed,
it is up to the user to interpret the meaning.
But this trend has already begun to change since Sir Berners-Lee introduced the term of
Semantic Web in his article Semantic Web Roadmap in 1998 [1] and the following The
Semantic Web in 2001 [2]. He was wondering what if machines can talk and change
information available around the Web to each other. This idea can be done by making
the semantic of the information understandable to the machines which is the main
goal of the Semantic Web. The Semantic Web will enable machines to comprehend
semantic documents and data, not human speech and writings [2].
1http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
15/84
1. Introductions
Page | 3
However Semantic Web is not designed to replace the Web of today but to improve it,
so the next generation of Web is accessible both to human and machines.
1.2.2. Resource Description Language (RDF)
The Resource Description Language (RDF) is a general-purpose language for
representing information about resources in the Web. It is particularly intended for
representing metadata about Web resources, but it can also be used to represent
information about objects that can be identified on the Web, even when they cannot be
directly retrieved from the Web [3].
RDF allows semantics to be expressed in a way, so that the information can be
processed by applications also, rather than being only displayed to the users. Basically,
RDF defines a data model for describing machine-processable semantics of data [4].The basic data model consists of three objects:
Resources. A resource may be an entire Web page, a part of a Web page, awhole collection of pages, or an object that is not accessible via the Web (e.g., a
printed book). Resources are always named by URIs.
Properties. A property is a specific aspect, characteristic, attribute, or relationused to describe a resource.
Statements. A specific resource, together with a named property plus the valueof that property for that resource, constitutes an RDF statement. These three
individual parts of a statement are called, respectively, the subject, thepredicate,
and the objectof the statement.
RDF statements can be represented in a triple notation called N3, in RDF/XML
serialization, and as a graph of triples.
Let us start with an example of statements in natural language
Kaka plays for AC Milan.
Kaka has jersey number 22.AC Milan has a website accessible at http://www.acmilan.com/.
In N3, the statements are presented as follows
.
"22" .
.
Listing 1.1. N3 statements
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
16/84
1. Introductions
Page | 4
Listing 1.2 shows the same example in RDF/XML
22
Listing 1.2. RDF/XML serialization
Figure 1.1 presents the example in graph
Figure 1.1. RDF graph representation
The resource subjects and objects are drawn as ellipses, the literal object as a square,
and the properties as labeled-directed arcs.
RDF Schema is a simple set of standard RDF resources and properties to enable people
to create their own RDF vocabularies. The data model expressed by RDF Schema is the
same data model used by object-oriented programming languages like Java. The data
model for RDF Schema allows you to create classes of data [5].
1.2.3. SPARQLRDF is the foundation of the Semantic Web. It is expected that in the future more and
more open RDF datasets are released. Allowing easy access to these collections requires
a query language that able to execute against RDF data.
Since January 2008, RDF Data Access Working Group (DAWG) of the World Wide
Web Consortium (W3C) has released a query language for RDF called SPARQL [9]; a
recursive acronym stands forSPARQL ProtocolandRDFQuery Language.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
17/84
1. Introductions
Page | 5
SPARQL can be used to express queries across diverse data sources, whether the data is
stored natively as RDF or viewed as RDF via middleware. SPARQL contains
capabilities for querying required and optional graph patterns along with their
conjunctions and disjunctions [9]. SPARQL is also considered as a component of theSemantic Web.
Making Simple Query
Most of SPARQL queries contain a set of triple pattern called basic graph pattern(BGP). Each part of the triples acts as a subject, a predicate, or an object which may be
explicitly defined to a resource or literal or as a variable. SPARQL variables are
prefixed either with ? or$.
An example of RDF data presented as an RDF graph is shown below
Figure 1.2. RDF graph example
A query to find person with a given name and email address is executed against the
given RDF data
PREFIX foaf: SELECT ?name ?mboxWHERE{ ?x foaf:name ?name .?x foaf:mbox ?mbox
}
Listing 1.3. Simple query obtaining a name and mbox of a FOAF profile
The PREFIX clause defines a namespace for the FOAF2 location. The SELECT clause
specifies what the query should return in this case variable name and mbox. WHERE
clause provides the basic graph pattern to match against the data.
2http://xmlns.com/foaf/spec/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
18/84
1. Introductions
Page | 6
The query matches the graph pattern of the query to the data model. The result of this
query is shown in Table 1.1. A solution sequence consists of one or multiple solution if
there is a match between the graph pattern and the model, or zero solution if there is no
matching pair.
name mbox
"Bobby Iceman"
"Tony Ironman"
Table 1.1. SPARQL result for simple query
Value Constraints
FILTER is an optional clause to restrict solutions to those for which the filter expression
evaluates to TRUE.
PREFIX foaf: SELECT ?name ?mboxWHERE{ ?x foaf:name ?name .?x foaf:mbox ?mbox .FILTER regex(?name, ^Bobby)
}Listing 1.4. A query obtaining a name and mbox of a FOAF profile with value constraint
The query above will give Bobby Iceman and his mailboxas the solution.
Including Optional Values
SPARQL query will give a non-empty solution sequence only if every query pattern
matches to the data model. Unfortunately, if at least a query pattern fails to match the
model, then the entire query will give an empty solution sequence. So, it is useful to
have query patterns that still allow the query to provide bindings even if a part of the
query pattern fails to match the data model. OPTIONAL clause gives this feature: even
if the optional part does not create any binding, it does not eliminate the solution.
SELECT ?name ?mboxWHERE{?x foaf:mbox ?mbox .OPTIONAL ( ?x foaf:name ?name )
}
Listing 1.5. A query obtaining a name and mbox of a FOAF profile with OPTIONAL clause
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
19/84
1. Introductions
Page | 7
The query above tries to find all the email address no matter it has the persons name or
not, as shown in Table 1.2.
name mbox
"Bobby Iceman"
"Tony Ironman"
Table 1.2. SPARQL result with OPTIONAL clause
Group Graph Pattern
Group graph pattern can consist zero, one, or multiple basic graph patterns. Groupgraph pattern is delimited with curly braces {}. The query in Listing 1.6 can be
rewritten into a query in Listing 1.3 that groups the triple patterns into two basic graph
patterns. Even both of queries have different structure; they give the same solution
sequence.
SELECT ?name ?mboxWHERE{ { ?x foaf:name ?name . }{ ?x foaf:mbox ?mbox . }
}
Listing 1.6. Simple query getting a name and mbox of a FOAF profile with group graph pattern
An extensive explanation of SPARQL syntax and semantics can be found on SPARQL
Query Language for RDF document [9].
SPARQL Extensions
There are a number of limitations in current SPARQL version, such as SPARQL is
read-only and cannot modify RDF dataset, it does not support subqueries and aggregate
functions, and so on. However, Openlink Virtuoso3 provides some extensions for
SPARQL in order to overcome the limitations above.
In this thesis only SPARQL extension for subqueries and aggregate function COUNT
will be explained, because these extensions are intensively used in the thesis.
COUNT function: COUNT function provides a function to count the number of the
solutions satisfying the criteria specified in the WHERE clause. With the count
aggregate the argument may be either * that means counting all rows, or a variable
name that means counting all the rows where this variable is bound. There can be an
3http://www.openlinksw.com/virtuoso/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
20/84
1. Introductions
Page | 8
optional distinct keyword before the variable that is the argument of an aggregate. An
example can be seen in Listing 1.7. The example returns the count the amount of
variable o for each distinct p.
select ?p count (?o)from where {?s ?p ?o};
Listing 1.7. Example of COUNT clause
Subquery extension: Subquery or Inner query or Nested query is a query inside a
query. It is usually used for a complex computation that cannot be done by using only
one query. In SPARQL, subquery is added inside the WHERE clause of the query.
For example, one use case was taking all the teams in the database and for all with over
5 members, add the big_team class and a property for member count.
construct { ?team a big_team . ?team member_count ?ct }where {?team a team .{ select ?team2 count (*) as ?ctwhere { ?m member_of ?team2 } .filter (?team = ?team2 and ? ct > 5)
}}
Listing 1.8. Example of subquery
SPARQL Query Results XML Format
Most of SPARQL processors provide the SPARQL query result in a various document
format, so it allows programmers to choose the most convenient format for their
application. To make the result serializable to any application, W3C recommends
SPARQL Query Results XML Format [10], so that the returned result set is written as
an XML document.
The SPARQL results in XML document of the query in Listing 1.3 is shown below:
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
21/84
1. Introductions
Page | 9
BobbyIceman
mailto:[email protected]
TonyIronman
mailto:[email protected]
Listing 1.9. SPARQL query results in XML format
SPARQL results document begins with document definition and anamespace -- http://www.w3.org/2005/sparql-results# -- where all of the key elements
belong to. Inside the element there are two sub-elements, and a
results element which can be for SELECT queries or for
ASK queries. The element declares all the variables returned on the result set.These variables are the same like the variables declared in the SELECT clause of the
query. If we see back to Table 1.1, the variables are equivalent to the column heading.
section contains solution sequence (a set of query solutions). Each query
solution is stored in the sub-element . Every element
corresponds to every row in Table 1.1. Every element contains one or more
element with a name element property defining the bound variable.
The value of a query variable binding, which may be a resource/URI, a string literal, a
typed literal, or a blank node, is included as the content of the as follows:
RDF URI Reference UU
RDF Literal SS
RDF Literal S with language LS
RDF Typed Literal S with datatype URI DS
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
22/84
1. Introductions
Page | 10
Blank Node label II
If a variable is unbound, there will be no element for that variableincluded in the element.
1.2.4. DBpedia Project
The DBpedia project [6] is a community effort to extract structured information from
Wikipedia and to make this information available on the Web. DBpedia allows you to
ask sophisticated queries against Wikipedia and to link other datasets on the Web to
Wikipedia data. The goals of DBpedia projects are to convert Wikipedia content to a
large, multi-domain RDF dataset, which can be used further in Semantic Web
applications, to interlink DBpedia dataset with other open datasets creating a large Webof open data, and to develop interfaces so Web services can make use of DBpedia
dataset.
The DBpedia project extracts various kinds of structured information from Wikipedia,
such as infobox templates, categorization information, images, geo-coordinates, and
links to external websites [7]. Since DBpedia 3.2, the new infobox extraction method
was introduced to create DBpedia ontology.
The DBpedia dataset currently consists of around 274 million RDF triples, which have
been extracted from Wikipedia editions in 14 languages. The DBpedia knowledge base
currently describes more than 2.6 million things, including at least 213,000 persons,328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. It features
labels and short abstracts for these things in 14 different languages; 609,000 links
to images and 3,150,000 links to external web pages; 4,878,100 external links into other
RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories.
The DBpedia dataset can be accessed online by querying via a public SPARQL query
endpoint at http://dbpedia.org/sparql, hosted by Virtuoso, or by browsing as Linked
Data using Semantic Web browsers like Disco, Tabulator, or Marbles.
The DBpedia dataset is interlinked with various open dataset on the Web using RDF
links, this enable DBpedia users to discover information starting from a resource inDBpedia dataset to related data within other sources. RDF links utilization between
related resources creates a giant Web-of-Data which within September 2008 consists of
approximately 2 billion RDF triples. Figure 1.3 gives an overview of the Web of
interlinked data.
The Web-of-Data enables users to navigate for example from a resource of a musical
band in DBpedia dataset to a list of their songs in Musicbrainz or to a list of reviews of
the band in Revyu.
http://dbpedia.org/sparqlhttp://dbpedia.org/sparql7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
23/84
1. Introductions
Page | 11
The example RDF link below connects a URI of Portishead in DBpedia to a related URI
in Musicbrainz:
owl:sameAs
Figure 1.3. Linking Open Data cloud
DBpedia provides three different classification schemata for things.
1. Wikipedia Categories are represented using the SKOS vocabulary
4
.
2. The YAGO Classification5 is derived from the Wikipedia category systemusing Word Net.
3. Word Net Synset Links were generated by manually relating Wikipediainfobox templates and Word Net synsets, and adding a corresponding link
to each thing that uses a specific template. In theory, this classification should
be more precise then the Wikipedia category system.
4http://www.w3.org/2004/02/skos/
5http://www.mpi-inf.mpg.de/yago-naga/yago/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
24/84
1. Introductions
Page | 12
These classifications enable users to select a specific thing mentioned in the executed
SPARQL queries.
1.2.5. MusicBrainz
MusicBrainz6 is a community music metadatabase that attempts to create a
comprehensive music information site. MusicBrainz collects this information about
recordings and makes it available to the public. Users can contribute their knowledge
about music which then can be shared with others.
The music metadata can consist of all data about Artists, Releases, Tracks, Labels, and
advance relationship among them.This metadata is stored in a Postgersql relational
database engine.
MusicBrainz has URI schemes to identify their entities, such as
http://musicbrainz.org/artist/UUID
http://musicbrainz.org/release/UUID
http://musicbrainz.org/track/UUID
http://musicbrainz.org/label/UUID
where UUID is a Universally Unique Identifier7 in its 36 character ASCII
representation.
1.2.6. DBTune
DBTune8 hosts a number of servers, providing access to music-related structured data,
in a Linked Data fashion. DBTune provides all the data based on open Web standard
such as RDF and SPARQL.
Various datasets has been provided by DBTune, including MusicBrainz data. DBTune
maps this MusicBrainz data based on Music Ontology [28]. And now MusicBrainz data
is available via SPARQL endpoint, http://dbtune.org/musicbrainz/sparql, powered byD2R server. The basic graph of MusicBrainz located at http://musicbrainz.org/.
6http://musicbrainz.org/
7http://en.wikipedia.org/wiki/UUID
8http://dbtune.org/
http://musicbrainz.org/label/UUIDhttp://dbtune.org/http://dbtune.org/http://dbtune.org/musicbrainz/sparqlhttp://musicbrainz.org/http://musicbrainz.org/http://dbtune.org/musicbrainz/sparqlhttp://dbtune.org/http://musicbrainz.org/label/UUID7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
25/84
1. Introductions
Page | 13
1.2.7. Faceted Navigation
Data on the Semantic Web is semi-structured and does not follow one fixed schema [8].
Faceted navigation [12] is an exploratory interface suitable for such data. Facets refer tocategories used to characterize information items in a collection [27]. By categorizing
data into facets, the exploration takes place when the user selects any restriction values
of the facets in order to filter the result set.
There are two possible methods to map subject-predicate-object RDF triples into facets.
First, facets are computed by the predicate that connects two resources; information
elements are RDF subjects, facets are RDF predicates and restriction-values are RDF
objects [8]. Second, facets are computed from all resources that related to resources in
the result list and grouped by their specific characteristics or concepts; using predicate
rdf:type orskos:subject for example. [13] implemented the latest method in
their browser interface. An example implementing these methods is explained below.
A result list of people related to Formula 1 racing can have predicates such as first
team, current team, former team,orlast teamthat connect to a group of
racing teams, and predicates such as lives in, born in, ordied inthat connectto a group of countries. With the first method, there will be 7 facets constructed from
the predicates. But, only facets racing team and country would appear if the second
method is applied.
A faceted interface has several advantages over keyword search or explicit queries: it
allows exploration of an unknown dataset since the system suggests restriction values at
each step; it is a visual interface, removing the need to write explicit queries; and itprevents dead-end queries, by only offering restriction values that do not lead to empty
results [9].
1.2.8. gFacet Project
gFacet [11] is a browsing approach that supports the exploration of RDF datasets by
combining graph-based visualization with faceted filtering functionalities. With this
combination, gFacet facilitates to explore of large and highly interrelated RDF datasets.
The major aims of the approach are:
1. Prevention of an over-cluttered graph: The facet-based visualization groupsthe instances of data into separate facets according to their characteristics.
Rather than visualizing each relation of an instance to any single instance by an
edge, the facet-based visualization allows to visualize the relationship between
an instance of a facet and one or more instances in the other facet only with a
single edge, and this relation between instances only indirectly visible when
certain facet in an instance get selected by the user in order to filter the result set.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
26/84
1. Introductions
Page | 14
2. Representation of relations between facets: Visualizing the information as agraph facets as nodes and the relations between them as labeled edges can
make the hierarchy of the information more understandable.
3. Single coherent visualization: A graph-based visualization prevents the usersfrom getting lost in hyperspace by displaying all the information in a single
visualization instead of being visualized over several screens or windows.
The Architecture
gFacet is built on a Three-Tier Architecture consists of client tier, server tier, and data
tier.
Figure 1.4. gFacet Architecture
The first tier of Three-Tier Architecture is the Client Tier in which gFacet user interface
is displayed. gFacet is implemented using Adobe Flex9a framework for creating Rich
Internet Applications (RIAs) based on Adobe Flash10 platform. RIAs created with Flex
can run in every browser installed with Adobe Flash player. So this makes gFacet to be
an interactive RDF data browser that can run in every operating system as long as it has
a browser and Adobe Flash player installed.
The Server Tier is the application server also called the middleware where the
application logic and server software are stored. The middleware is implemented in
PHP. It provides the logic of query generation. It generates queries according to user
tasks and sends back the query result to the client tier. Because gFacet is a Flash-based
user interface, AMFPHP11 is used to serialize the communication between gFacet and
the PHP class objects on the server. To be able to get the data from a relational database
and transform it into an RDF model, RAP (RDF API for PHP) 12 package is used. RAP
9http://www.adobe.com/products/flex/
10http://www.adobe.com/products/flash/
11http://www.amfphp.org/
12http://www.seasr.org/wp-content/plugins/meandre/rdfapi-php/doc/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
27/84
1. Introductions
Page | 15
is a software package for parsing, querying, manipulating, serializing and serving RDF
models. Then the available RDF models are manipulated using RAPs SPARQL
package.
The Data Tier is where the physical data is served for the application. The RDF data is
stored in a relational database.
The Prototype
gFacet can be accessed via the internet using a browser with a Flash player installed. In
the prototype version, gFacet uses a sample of dataset form the field of music.
Initially, a node contains a list of songs is displayed. The node can be expanded to other
related node by selecting a pair of a relationship and a facet to which it refers, from a
dropdown menu on the bottom of the node. If a user selects a pair from the dropdown
menu, a new node is opened and gets connected to the original node by a labeled edge.
The instances of the new node can act as a filter for the instances of the connected node.
Expanding the nodes gradually can create a collection of hierarchical facets as
illustrated in Figure 1.5.
Figure 1.5. gFacet user interface
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
28/84
1. Introductions
Page | 16
1.3. Task Description
In this Thesis, we focus on topics of accessing and querying RDF data on an arbitrary
RDF sources with SPARQL and applying it in a faceted RDF browser. We use gFacetas the platform to display the information to the users after the data manipulation in
faceted manner has been performed. But in order to make gFacet performance better in
the case of data access and availability, we need to do some modifications on the gFacet
architecture and then rebuild the SPARQL queries being used.
The Three-Tier Architecture of gFacet has drawbacks in the case of resource utilization
and execution time. The idea to prevent these issues is to make gFacet does all
computation of its logic full in the client side, so gFacet can directly access any
SPARQL endpoint without requires any server in between. We will move all the
functionalities of AMFPHP, the application logics, and RAP packages by rebuilding
similar functionalities in the Client Tier.
The second task is to build new queries that will improve the gFacet performance. The
queries should be efficient and accurate so it can support gFacet accessing a large RDF
dataset in faceted manner.
The main SPARQL endpoint and RDF dataset we use in this thesis is the one that has
been released by DBpedia Project. Because DBpedia is a large, multi domain RDF
dataset extracted from Wikipedia, so it makes DBpedia to be a good source of
information. We need to build queries that allow user to browse DBpedia in effective
and efficient way. And then we should evaluate the performane of the queries by
measuring the time required to execute the queries and measuring the accuracy.
Since DBpedia is also interlinked to various RDF sources around the Web, it allows the
user to jump from a resource in DBpedia to a related resource to an RDF dataset in
another source by following the given RDF link. A nice example is that DBpedia is
interlinked to the MusicBrainz datasets. For the third task in this thesis, we try to adjust
gFacet to be able to follow RDF links from DBpedia to MusicBrainz, so that the users
can explore the data from both sources like they were browsing from only one huge
dataset.
1.4. Related Works
SPARQL has been recommended by W3C as a standard language for querying RDF
datasets. Many approaches has been investigated how to query RDF data using
SPARQL.
Erling [21] investigated sample of SPARQL queries to be executed against large
datasets using Openlink Virtuoso SPARQL engine. The queries consist of SPARQL
extensions that work with Virtuoso at the back-end.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
29/84
1. Introductions
Page | 17
Many other researches try to optimize the performance of SPARQL. [23], [26], and [24]
implemented different approaches to optimize the query, but these researches suggested
query reordering in order to get an optimized query execution plans.
Even though more and more data available over SPARQL endpoints, however it is still
difficult to integrate data from multiple data sources. RDF data integration is often done
by loading all data into single repository and querying the merged data locally [22].
Tabulator [16] uses this approach. Tabulator collects all the information by following
the related resources indicated by owl:sameAs orrdfs:seeAlso predicate andstores it in the local repository. Tabulator allows user to query against the locally-stored
data.
Quilitz and Leser [22] built DARQ13, a query engine for federated SPARQL queries. It
provides transparent query access to multiple endpoints. The implementation introduces
service description that provides the declarative descriptions of the data available fromeach endpoint, which will be used to determine the endpoint a query should be sent to.
1.5. Thesis Outline
This thesis is organized as follows
Chapter 2 explains how to make gFacet into a full client-side application. It starts with
the main strategy and then explains the new gFacet architecture and new components
that have been built during the thesis.
Chapter 3 explains in details how the generated queries work but it will give
explanation step by step according to the user action while browsing DBpedia dataset
using gFacet. At the end, this chapter will show the evaluation result based on time
measurement and accuracy measurement.
Chapter 4 describes how to make gFacet able to use more than one RDF source. In
here, gFacet will be set to be able to execute dataset both from DBedia and
MusicBrainz.
Chapter 5 will give some short summary of the implementation and evaluation of this
thesis. This chapter also introduces some ideas that can be foundation for improvements
of gFacet in the future.
13Distributed ARQ, as an extension to ARQ (http://jena.sourceforge.net/ARQ)
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
30/84
Page | 18
Chapter 2
The Architecture
The current gFacet works on a Three Tier Architecture consisting the Client Tier, TheServer Tier, and Data Tier. The Client Tier is where the user interface and the
presentation logic reside. The AMFPHP1, the application logic written in PHP2, and the
RAP3 packages are located in the Middle Tier. And the physical relational database is
placed in the Data Tier.
With this kind of architecture, there are two issues to be considered.
The Three Tier Architecture is resource expensive. Using middleware meansmore resources, such as more dedicated machines, more space or working
memory usages, is necessary in addition to a database server.
It is relatively time consuming. Instead of directly communicate with the DataTier, an additional processing in the Server Tier needs to be done. This means
the total execution time will increase eventually.
These issues trigger us to make gFacet has better performance by making gFacet to be a
client-side application. This chapter explains the strategy to make it possible. The
structure of this chapter is as follows. Section 2.1 describes the strategy in general.
Section 2.2 explains about the new architecture and the implementation.
2.1 The Strategy
The problems of the gFacet architecture are located in the Server Tier. There are too
much intermediate processing before users commands can be processed in the Data
Tier. The main focus of the strategy is the PHP applications logic and the RAP packages
in the Server tier. Since the PHP applications logic plays the important role that it is
where the queries are built. And the RAP packages play the role as the query dispatcher
1http://www.amfphp.org/
2http://www.php.net/
3http://www.seasr.org/wp-content/plugins/meandre/rdfapi-php/doc/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
31/84
2. The Architecture
Page | 19
and as a parser for the result from the database server. We can put aside the AMFPHP
because this package is only used to serialize the communication between gFacet and
the middleware.
Mainly, there are three steps that have to be done to make gFacet a full client-side
application.
1. We have to rebuild the query builder into the Client Tier; this will replace thePHP application logic in the server.
2. In order to replace the RAP packages, we have to build a query dispatcher thatwill send the query to SPARQL endpoint; and
3. Build a parser for the result returned by the SPARQL endpoint.
Since gFacet is implemented using Adobe Flex4, all the new components will be built in
a client-side scripting language, Actionscript5, as the core scripting language of Flex. By
this implementation, the application logic will fully run in the client.
2.2 Client-side gFacet Architecture
By accomplishing all the three steps, mentioned in the previous section, we have
simplified the architecture of gFacet. We have moved all the necessary functionalities
such as query building and dispatching, and result parsing into the Client Tier. So now
the new gFacet will completely run in the client-side. And this means that gFacet is nowbuilt on a Two-Tier Architecture consisting Client Tier and Data Tier as illustrated in
Figure 2.1. With this architecture users request can be invoked to the SPARQL engine
without any intermediate processing in between.
Figure 2.1. gFacet Two-Tier Architecture
4http://www.adobe.com/products/flex/
5http://www.actionscript.org/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
32/84
2. The Architecture
Page | 20
In the next subsections, we will explain briefly the three architecture components which
we develop during the thesis; the query builder, the query dispatcher, and the SPARQL
result parser. The development of presentation handler, the action listener, and also the
SPARQL endpoint is out of scope of this thesis, because they have been establishedeven before the thesis started.
2.2.1 SPARQL Query Dispatcher
A SPARQL endpoint allows SPARQL query to be conveyed as a HTTP request over
the Web using a GET or POST method. This HTTP request is assembled and sent to the
SPARQL endpoint by the query dispatcher. The request package contains some
parameters that required by the SPARQL endpoint. The parameters are given as follows
1. query specifies the query pattern will be executed.
2. default-graph-uri which specifies the graph to be used to form the defaultgraph. Specifying this parameter will overwrite the defined default graph in the
query pattern using FROM clause.
3. output which specifies the result format to be returned. In this application weexpect an XML document of the SPARQL query result.
The query dispatcher is derived from the HTTPService class of the Actionscript, as
shown in Figure 2.2. The send() method of the HTTPService object is able to send
a HTTP request to a host specified by the url variable, and an HTTP response isreturned. The method is also able to pass parameters to the specified url. Hence, inside
the execute() method ofSPARQLQuery object, the required SPARQL parameters
explained before are packaged together into one object variable called parameters
which will be sent within the HTTP request by calling the sent() method.
Figure 2.2. SPARQLQuery (the dispatcher) class diagram
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
33/84
2. The Architecture
Page | 21
The abstract HTTP trace example in Listing 2.1 illustrates the invocation of the
SPARQL query in http://example.org/sparql/ SPARQL endpoint with a GET sending
method. The EncodedGraphURI and EncodedQuery are equivalent
representation of the graph URI and the query pattern that have been encoded.
GET /sparql?default-graph-uri=EncodedGraphURI&query=EncodedQuery&
output=xml HTTP/1.0
Host: example.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8
Listing 2.1. HTTP trace of query dispatching
2.2.2 SPARQL Result Parser
W3C recommends a standardized XML document as an optional SPARQL result format
to be easily serialized by any applications. We build a parser for this XML document in
order to understand the document structure and then to parse the elements into
Actionscript native datatype, which are required for further processing.
All variable names found inside will be stored in an array datatype. All
the results of a SELECT query will be stored in a multidimensional array, which each
row represent one query solution.
One thing to be considered, that each binding variable inside a solution could be a value
of a resource/URI, literal, or blank node. This RDF Term is defined in ,, or , and each of it has different behaviors as modeled in
Actionscript classes shown in Figure 2.3.
Figure 2.3. RDFTerm class diagram
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
34/84
2. The Architecture
Page | 22
So, the array representation of the variable names and the query solution given by an
example XML document in the Listing 1.7 will be represented in Listing 2.2 as follows
arrVariables = [name, "mbox];
arrResult =
[
[ name => new Literal(Bobby Iceman, null,
http://www.w3.org/2001/XMLSchema#string),
mbox => new Resource(mailto:[email protected])
] ,
[ name => new Literal(Tony Ironman, null,
http://www.w3.org/2001/XMLSchema#string),
mbox => new Resource(mailto:[email protected])
]
]
Listing 2.2.Native Actionscripts datatypes representing the query results
The first query solution denoted by arrResult[0] consists of variable binding name
and mbox. The value of variable binding name is a Literal object with a label Bobby
Iceman as a string datatype. And the value ofmbox is a Resource object with a URI of
mailto:[email protected]. And so on with the second query solution.
A element is a result of ASK query. The value of this element could be
TRUE or FALSE. This value will be stored as a string Actionscript datatype.
2.2.3 SPARQL Query Builder
We implement the query builder by mainly rebuilding the PHP application logic located
in the server into the Client Tier. The query builder is responsible to generate SPARQL
queries and send the query patterns to the query dispatcher.
The queries are generated according to the received commands from a caller function.
The queries are then sent to a SPARQL endpoint by the query dispatcher. Eventually,
after the execution is accomplished; the endpoint will send back the result as an XML
document. Then the XML document will be parsed and the query results will be
converted into a multidimensional array datatype. This conversion is necessary to be
done, in order to make further processing easier. Then the array representation will bemanipulated to extract the required data, before it is sent back to the caller function.
Finally the presentation handler is responsible to display the data to the user. The
dataflow is illustrated in Figure 2.4.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
35/84
2. The Architecture
Page | 23
Figure 2.4. gFacet Data Flow Diagram
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
36/84
Page | 24
Chapter 3
Browsing DBpedia
The previous chapter explained about the new architecture of gFacet. Now theapplication logic resides in the Client Tier including the Query Builder class where all
the queries according to the users commands are built. This chapter will look deeper
into the queries which are constructed while user interacts with gFacet. The Query
Builder will interpret every users action into queries that will be sent to the SPARQL
endpoint.
Instead of describing the queries in abstraction, we will explain the query case by case
according to users action while exploring an RDF dataset with gFacet. In this chapter,
we will use dataset released by DBpedia1 project. DBpedia dataset contains multi
domains information which is extracted from structured information from Wikipedia2.
In this chapter, we will also evaluate the queries execution time and its consistency to
give the correct result.
This chapter is structured as follows. Section 3.1 gives short explanation about
dispatching a query to DBpedia. Section 3.2 explains query implementation while
browsing the DBpedia. And then we run the evaluation in Section 3.3.
3.1. Dispatching a Query to DBpedia
DBpedia released its dataset in order to make it available over the Web. The dataset can
be accessed as a Linked Data or via a SPARQL query endpoint. In this the thesis, we
focus on accessing DBpedia dataset by executing queries to the DBpedia SPARQL
endpoint in order to get the result to be viewed using gFacet.
1http://dbpedia.org/
2http://wikipedia.org/
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
37/84
3. Browsing DBpedia
Page | 25
The queries can be sent via HTTP request to the endpoint located at
http://dbpedia.org/sparql. The endpoint is provided using OpenLink Virtuoso3 virtual
database engine.
As explained in Section 2.2.1, the endpoint location and the sending method need to be
predefined and then the query can be sent along with other parameters required by the
endpoint, such as the default graph and the result format. In gFacet, dispatching a query
is done as follows
1: var endpoint:String = http://dbpedia.org/sparql;2: var aQuery:SPARQLQuery = new SPARQLQuery(endpoint);3: aQuery.defaultGraphURI = "http://dbpedia.org";4: aQuery.method = "GET";5: aQuery.resultFormat = "xml";6: aQuery.execute();
Listing 3.1. Dispatching a SPARQL query to DBpedia
Listing 3.1 describes query dispatching to DBpedia endpoint as a GET method. The
query is executed against the DBpedia default graph which located at http://dbpedia.org.
We expect the result as an XML document.
The queries that have been built in this thesis are not standard SPARQL queries like the
W3C recommendation. There are some cases where the standard SPARQL cannot carry
out the tasks. In this thesis we use SPARQL extension such as free-text searching, and
aggregating COUNT() function proposed by OpenLink Virtuoso. So, the queries webuilt may not work in other SPARQL query service except for OpenLink Virtuoso.
3.2. Exploring DBpedia with gFacet
In this section we will describe the detail implementation of the queries generated by
gFacet tool. We use DBpedia dataset as the source of our information.
The explanation in this section is given step by step according to the users goals. Lets
make an example case and then we will explain how the queries and the gFacet user
interface achieve this goal from such a large interlinked dataset like DBpedia. The caseis given as follows
A user is very interested in German Football Clubs. He needs to see all the relevant
clubs. He keeps exploring the dataset by looking for information of German Footballers
that plays for the German clubs, then English Football Clubs that ever hired the German
footballers, and Football Venues where the English clubs reside. And then he needs to
find the German football club for which a German Footballer named Thomas
Hitzlsperger plays. But after that, he is no longer interested in German Football Clubs.
3http://virtuoso.openlinksw.com/
http://dbpedia.org/sparqlhttp://dbpedia.org/http://dbpedia.org/http://dbpedia.org/sparqlhttp://dbpedia.org/sparql7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
38/84
3. Browsing DBpedia
Page | 26
He is changing his point of view. Now he is more interested in English Football Clubs
and continues the exploration then.
Lets formulate the case in simple way
Goal 1 : User is interested in information about German Football Clubs and he wants to
see all the list of the clubs
Goal 2 : User expands the graph by looking for
(a) All the names of German Footballers that play the for the German Clubs,
(b) All the English Clubs for which the footballers in (a) had ever played, and
(c) The venues where the English Clubs in (b) reside.
Goal 3 : User wants to see the German Football Club where Thomas Hitzlsperger plays
for. He selects the players name in order to filter the clubs and receive the
information he needed.
Goal 4 : While exploring the user decides to be more interested in English football clubs
than in German ones. Now he changes his perspective on the information.
All the steps to achieve this goal are shown in Figure 3.1 and all the details of every
steps will be described in the several next sections.
Figure 3.1. User actions flow while browsing DBpedia
3.2.1. Searching for Concepts
One of the gFacet features is a capability to search for a concept in order to define the
initial node to begin with the exploration. User can specify a string of keyword to be
matched with the concept name. A list of concepts and the amount of instances, which
each concept has, will be displayed as a return (see Figure 3.2).
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
39/84
3. Browsing DBpedia
Page | 27
According to Goal 1, the user is interested in information about German football clubs.
A screenshot of gFacet interface, Figure 3.2, shows the user specifying his idea by
typing the keyword german football. Then a list of concepts that contains german
football text will be displayed also with the number of instances that belong to eachconcept.
Figure 3.2. Searching a concept
Free-text searching within DBpedia texts can be performed using bif:containspredicate which has been proposed for SPARQL extension by Openlink Virtuoso. Since
Virtuoso 5.0, it is possible to declare RDF object of triples with a given predicate or
object get indexed [25]. Using bif:contains, the triples that have been indexed canbe found.
Actually, there is a more-generic way of searching texts; by using standardized
SPARQL function regex(). But we prefer to use bif:contains predicate rather
than regex() function. bif:contains looks for the objects from the indexing
table, instead of searching in the whole dataset like regex() does. This makes the
bif:contains works faster especially if the query is executed against a large
dataset. So, for a large dataset like DBpedia, using bif:contains is morereasonable.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
40/84
3. Browsing DBpedia
Page | 28
The query in Listing 3.2 is the generated query from the example illustrated in Figure
3.2.
1: SELECT DISTINCT ?concept ?label COUNT(?instance) AS ?numOfInstances2: WHERE {3: ?concept rdf:type skos:Concept .4: ?instance skos:subject ?concept .5: ?concept rdfs:label ?label .6: FILTER (lang(?label) = "en")7: ?label bif:contains "german and football" .8: }9: ORDER BY DESC(?numOfInstances) LIMIT 30
Listing 3.2. Query for concept searching
In Line 3, we are looking for resources that are identified as a concept. We bind theseresources as variable concept. Line 4 searches for instances of the concept, we bind
to variable instance. In Line 5, we use predicate rdfs:label to get a human-readable version of a resources name. Here we want to get a label of the concept and
we need only the categories which has label presented in English as we specified in
Line 6. In Line 7, we apply the bif:contains into the label of the concept by
specifying a string of keyword. Back in Line 1, we define the variable concept,
label, numOfInstances to be returned in the solution sequence. We calculate the
number of instances using SPARQL extension function, COUNT().
Due to so many possibilities of concepts to be found, we ask the endpoint to order the
result in descending (line 9) according to the number of instances each concept has, so
the concept at the top of the list might be the most relevant concept for the user.
To make the queries look simpler for explanation, we assume that all the resources
label will be returned only in English version. So for the next queries, we skip the query
patterns like in line 5 and line 6.
3.2.2 Selecting the Initial Node
In the list of concepts, the user sees the concept he is looking for, the German football
clubs. So he selects the concept and then the initial node of this concept will be opened,
as shown in Figure 3.3.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
41/84
3. Browsing DBpedia
Page | 29
Figure 3.3. Opening the initial node
The initial node is set as the result set by default. Result set is the information space that
the user interested to. This is the perspective of the user when he looks at the
information. All the things that he searches for are displayed in this node. Result set
node always appears with a dark grey color.
In gFacet, a node will have a list of the instances of a certain concept that can be paged
through, a pull-down menu of relations, and a button to set the node as the result set (see
Figure 3.3). The information that is required by the user will be displayed on the
instances list. The user can navigate from page to page to explore these instances. By
opening the initial node of German Football Clubs, Goal 1 has been achieved.
In the next subsections we will explain the details of the generated queries to retrieve all
the instances, the paging mechanism, and to obtain all the relations.
Retrieving the instances
The instances will be displayed in a list consists of label and description of the
instances. In order to avoid extensive scrolling when there are so many instances
displayed at once. gFacet provides a paging mechanism.
By using a paging mechanism, there are at least two tasks need to be done by the
application.
1. The application should be able to count the amount of all possible instances.This value is very important to determine how many page buttons should be
made.
2. The application should be able to produce just a subset of instances to bedisplayed according to the page selected by the user.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
42/84
3. Browsing DBpedia
Page | 30
In this thesis we try two approaches to retrieve the data and apply the paging
mechanism; the all-at-once querying and step-by-step querying.
All-at-once querying: In this first approach, we solve the tasks by using a standardSPARQL query in order to make the query more generic. In simple words, the query
retrieves all the possible instances from the endpoint into the Client Tier and then the
client application accomplish the paging mechanism by grouping the instances into
several pages and producing a subset of instances to be displayed. The query for this
approach is given as follows
1 : SELECT DISTINCT ?insOfresultSet ?comment_resultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {
6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }
Listing 3.3. Query to retrieve all intances
The query in Listing 3.3 is just simply asking if the URI of the selected concept is really
defined as a concept (Line 3). And if it is really a concept, then Line 4 searches for the
instances of it, by applying the skos:subject predicate.
In the Line 6 and Line 7, the query requires for an English description of the concept,but we put this requirement into an OPTIONAL clause (Line 5), which means that thepatterns are not necessarily to have a binding result.
The advantage of this approach is that once all the data has been retrieved, exploring the
pages will be comfortably fast, because the application does not have to execute queries
anymore. However, the drawback of this approach is that it tends to take a long time for
the query engine to query all existing instances of a certain concept especially if there
are a lot of instances available.
The second drawback of this approach is caused by the result limitation for every query
execution. To protect service from overload, the SPARQL endpoint truncates queryresults into only 1000 rows every execution [6]. This makes gFacet cannot get the rest
of instances if the concept has more than 1000 possible instances.
Step-by-step querying: To solve the drawbacks of the first approach, we introduce the
Step-by-step querying approach. We try to accomplish the tasks by executing two
queries; one query of each task. The first query is to ask the query engine to return only
a subset of instances, instead of asking the whole query solutions to be returned. This
task is possible to do by using standard SPARQL clauses, OFFSET and LIMIT (see
Listing 3.4).
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
43/84
3. Browsing DBpedia
Page | 31
1 : SELECT DISTINCT ?insOfresultSet ?comment_resultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }10: ORDER BY ASC(?label_insOfresultSet) OFFSET 0 LIMIT 10
Listing 3.4. Query to retrieve a subset of instances
In general the query is the same with the one in Listing 3.3, the difference, which is theimportant thing of the query, is located in line 10. First, we ask the results to be ordered
by the label of the concept in ascending using ORDER BY ASC() clause. And then the
main focus of this approach is done by specifying OFFSET index and LIMIT clauses toget the subset of the available query solutions. We predefine a limit of 10 instances to
be returned at a time, starting from the defined value in OFFSET clause. These three
clauses are needed for paging mechanism; by ordering the solutions before OFFSET-ing
we will get a consistent and meaningful order. For example, ifOFFSET is set to 0, the
query returns instance #1 to instance #10; if the OFFSET is set to 10 we will getinstance #11 to #20, and so on.
The second query in this approach is then asking the amount of the possible solutions.But there is no function of standardized SPARQL capable to do this task. That is why
we need to use COUNT() function, which is also a SPARQL extension function
provided by Openlink Virtuoso, to calculate the amount of instances. We execute the
similar query to the first one, but we change slightly what to be returned by the
SELECT clause. In line 1, we can see that the query only needs the overall amount of
possible instances by specifying COUNT() function inside the SELECT clause (see
Listing 3.5).
1 : SELECT COUNT(DISTINCT(?insOfresultSet)) AS ?totalNumber_concept2 : WHERE {
3 : rdf:type skos:Concept .4 : ?insOfresultSet skos:subject .5 : OPTIONAL {6 : ?insOfresultSet rdfs:comment ?comment_resultSet7 : FILTER (lang(?comment_insOfresultSet) = "en")8 : }9 : }
Listing 3.5. Query for obtaining a total number of possible solutions
Even though we generate two independent queries for this approach, the application
works significantly faster by fetching just a subset of the instances, rather than using the
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
44/84
3. Browsing DBpedia
Page | 32
All-at-once querying approach that fetches all instances into the client. And with Step-
by-step approach, gFacet does not have a problem with concepts that have more than
1000 instances. gFacet can explore all the instances from the first to the last instance
without any limitation.
The drawback of this approach is that it still has to send queries if the user moving from
page to page, thus in this case this approach runs less fast than the All-at-once querying
approach. Despite of this drawback, the two advantages mentioned before state that the
Step-by-step querying approach is more suitable for gFacet.
We will also omit similar patterns to get the description of an instance, like Line 6 and
Line 7, in the next sections to make the query looks simpler for explanation.
Retrieving the Relations
Each node in graph will have a relations list that gives all the available relations to any
nodes related to the current one. These relations are given as a list in a drop-down menu
(see Figure 3.4). The relations are presented as pairs of RDF predicate and the related
concept name (predicate:nextConceptName). The amount of the related instances of
the new concepts is also displayed in the list of relations. By selecting any of these
relations, a new node related to the current one will be opened.
Figure 3.4. The Relation List
As an example, we take the second row of the relation list. A relation of
name:German_footballers means that one or more instances of the nodeGerman Football Clubs are related to an arbitrary number of resources by the RDF
predicate name. From these resources, there are 302 resources which are instances of
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
45/84
3. Browsing DBpedia
Page | 33
concept German Footballers. The connection between theses resources with concept
German Footballers are defined by the predicated skos:subject. The RDF graphfor this case is illustrated in Figure 3.5.
Figure 3.5. Constructing a pair of predicate and related concept
In Listing 3.6, we provide a query to fill the relations list. The query searches all the
predicates that semantically connecting the instances of the current concept with any
other resources. Then the query will search all the new concepts to which these
resources belong.
The important part of the query is on the Line 6 and Line 7. In Line 6, once the
instances of the current concept have been found, the query looks for any relatedresources of these instances based on certain RDF predicates. And in Line 7, the query
searches the new concepts to which these resources belong.
Line 2 calculates the number of instances for each new concept.
1 : SELECT DISTINCT ?prop ?newConcept2 : COUNT(DISTINCT ?instNewCat) AS ?numOfInstances3 : WHERE {4 : rdf:type skos:Concept .5 : ?instCurrConcept skos:subject .6 : ?instCurrConcept ?prop ?instNewConcept .
7 : ?instNewConcept skos:subject ?newConcept.8 : ?newConcept rdf:type skos:Concept .9 : } ORDER BY DESC(?instNewConcept) ?prop ?newConcept LIMIT 40
Listing 3.6. Query for obtaining all the pairs of predicate and concept
Using combination of RDF predicate and concept name for constructing a facet brings
into a relatively time-consuming query execution in the SPARQL engine, especially if
the current node has a large amount of possible instances. This is because every instance
could have a lot of predicates referring to other resources and then the query should
look for the concepts each of these resources belongs to. It is hard to handle a huge
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
46/84
3. Browsing DBpedia
Page | 34
number of combinations of both. This issue makes the execution in SPARQL endpoint
takes a long time to complete and in the worst case, execution time limit is exceeded.
To prevent the problem above, we set a limit number of 40 relations should be returnedfor this query. We order the relation list by the number of related instances in the new
concept in descending way (line 9). We realize that displaying only 40 relations is not
enough to generalize all the possible combination of relations, however by ordering the
number of related instances in descending way, then the most-likely important relations
for users will be viewed at the top of the list.
3.2.3. Expanding the Graph
Now we move to the Goal 2, which is to expand the graph by adding more nodes into it.
A new node is opened if user selects a certain relation from the drop-down relation list.
An edge will be created and labeled as the predicate selected by the user, as shown in
Figure 3.6. This edge will relate the current node and the new node, and indicate the
semantic relation between both of them. Gradually expanding the nodes will create a
chain of nodes which represents hierarchical facets. This new nodes act as the
constraints directly or indirectly for the result set.
In our case, the Goal 2a is to see all the German footballers and our starting node is the
initial node German Football Clubs. So from the relation list in the initial node, the user
looks for a relation that might be appropriate for his requirement. So he selects the
relation name:German_footballers and the new node will be open as shown inFigure 3.6.
Figure 3.6. Opening a new node by selecting a relation
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
47/84
3. Browsing DBpedia
Page | 35
At this point, Goal 2a has been done. The Goal 2b and Goal 2c are done similarly. To
get the English football clubs that are related to the German footballers, user selects a
relation clubs:English_football_clubs from the list in node German
Footballers. Then after the node English Football Club is opened, user can selectrelation ground:Football_venues_in_England to see all the stadium wherethe related English clubs reside. A screenshot of a chain of 4 nodes is presented in
Figure 3.7.
Figure 3.7. A chain of 4 nodes is created after user gradually expanding the nodes
In building a query for a chain of nodes, a special characteristic between a child node
and its parent has to be considered. An instance of a child node will not be displayed if
the instance is not related to any visible instance of its direct parent node. Figure 3.8
will demonstrate how the parent node and child node interact.
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
48/84
3. Browsing DBpedia
Page | 36
Figure 3.8. Model a chain of 4 nodes describing parent-child characteristic. An instance of child node
must related to at least a visible instance of its parent node Objects with dotted outline are not visible.
We can see in Figure 3.8, that there is no German club that hires a player with a name
B4, so that is why player B4 is not visible in the node German Footballers. And sobecause B4 is not visible, instance C1 in node English Football Clubs will be not
displayed also.
This characteristic is intentionally meant so that only relevant instances of a child node
can be used to filter instances of its direct parent or to filter indirectly the instances of
the result set.
Our approach to express this characteristic is by using nested OPTIONAL clauses. Each
child node has to be written inside an OPTIONAL clause. By using this clause, each
visible instance of parent node does not necessary to have a related instance in its child
node. But, each instance that is visible in the child node must have at least a relationwith visible instance in its parent node. The generated query for the chain shown in
Figure 3.8 is described in Listing 3.7.
1 : SELECT ?instResultSet2 : WHERE {3 : rdf:type skos:Concept .4 : ?instResultSet skos:subject .5 : OPTIONAL6 : {7 : rdf:type skos:Concept .8 : ?instOfB skos:subject .
7/29/2019 Faceted Exploration of Multiple RDF Data Sources Using SPARQL
49/84
3. Browsing DBpedia
Page | 37
9 : ?instResultSet dbpedia2:name ?instOfB .10: OPTIONAL11: {12: rdf:type skos:Concept .
13: ?instOfC skos:subject .14: ?instOfB dbpedia-owl:clubs ?instOfC .15: OPTIONAL16: {14: rdf:type skos:Concept .15: ?instOfD skos:subject .16: ?instOfC dbpedia2:ground ?instOfD .17: }18: }19: }20: }21: ORDER BY ASC(?label_instResultSet) OFFSET 0 LIMIT 10
Listing 3.7. Query for a chain of 4 nodes; The result set is the initial node
We can see that every child is written inside a nested OPTIONAL clause. The
indentations in Listing 3.7 show the level of the nodes in the graph. In line 3 4, the
result set is defined as the node A. Node A has a child which is node B, and the query
patterns for B are written in an OPIONAL clause. In line 78 the query searches for all
resources that are instance of node B. In line 9, here we declare the dependency between
node A and node B which is defined by the predicate dbpedia2:name. And so withLine 1214 and Line 1416 for class C and D. Because node C is a child of node B,
then all query patterns for C are written inside OPTIONAL clause too.
3.2.4. Filtering
The idea of exploration with gFacet is to restrict the available instances in the result set
by selecting arbitrary restriction values so that the user can find the relevant
information. Exploring data with gFacet eases the user by constructing the selection
queries automatically every time the user adds a constraint. First, user can only select a
filter instance at once and then gFacet will display the intermediate results in the result
set before user applying more selections.
We now describe closely the filtering operations that can be done with gFacet. There arefour filtering operations in gFacet: basic filtering, hierarchical filtering, union filtering,
and intersection filtering. gFacet allows a combination of operations as desired by the
user.
Basically, the filtering is propagated upward from the selected node until the result set.
While propagating upward, there might be intermed