Rdf, Sw, Sparql Final

Embed Size (px)

Citation preview

  • 8/3/2019 Rdf, Sw, Sparql Final

    1/18

    RDF TRIPLE STORES,

    SPARQL AND THE SEMANTIC

    WEB.

    Muntazir Mehdi

    Department of Computer Science

    Technical University Kaiserslautern

    67653, Kaiserslautern, Germany

    [email protected]

    Abstract. Current research in the area of the World Wide Web has mainly

    focused on the advent of a technology which enables machines to understand

    data. This results in a whole new type of Web which contains meaningful

    metadata in addition to the linked documents and their relationships, enabling

    roaming agents to extract useful information with an automated process. This

    new type of web is named Semantic Web. For sake of unified standards all

    developments in area of Semantic Web are being handled by World Wide WebConsortium (W3C). This paper briefly discusses some of already existing

    standards; first we provide a brief introduction about Resource Description

    Framework (RDF), SPARQL: Query Support for RDF and their syntax than we

    look into Data Management techniques for RDF triples and finally we conclude

    the paper by summarizing the individual parts of this paper.

    Keywords: Semantic Web, Resource Description Framework (RDF), SPARQL,

    RDF triple stores, RDF Data Management.

    1 IntroductionToday the Web focuses only on the syntactic representation of the information.

    This information is nothing more than just network of documents linked together in

    the form of web pages. This information is very much understandable to humans and

    thats itself the biggest drawback of it. Keeping in mind the drastic advancement in

    the field of internet, one can clearly see that the internet which once was created as

    communication infrastructure to facilitate communication between parties has

    evolved into information infrastructure where expectations are to extract information.

    This itself has some very important issues to be addressed e.g. What is the proper

    mailto:[email protected]:[email protected]:[email protected]
  • 8/3/2019 Rdf, Sw, Sparql Final

    2/18

    2 Muntazir Mehdi

    information? Where is the proper information? And when is the proper information

    needed? The word Semantics literally translates to Meaning of linguistic term.

    Semantic Web basically is web of content where web pages are linked with the help

    of semantic relation among them thus helping machines to process information in

    addition to humans which in turn is the most important improvement as seen in many

    writings.

    The Semantic Web will bring structure to the meaningful content of web pages,

    creating an environment where software agents roaming from page to page can

    readily carry out sophisticated tasks for users. [1]

    Now that we have seen some idea about the traditional web and semantic web we

    can easily answer the question, Why do we need Semantics? But this is not enough

    because the basis of Semantic web is still the old traditional web, therefore the issue

    of representing information still exists. Knowledge /Information representation is themost important part of Semantic Web. Unlike traditional web where information was

    represented with the help of HTML and other scripting languages which only

    represented the syntactic features of information, Semantic Web demands for a

    language which has the capability to incorporate semantic features so that the

    information can be inferred. Resource Description Framework (RDF) is one of the

    most popular Semantic Web languages which derives its root from XML. XML which

    itself is powerful enough to be used for information representation where a single

    domain of information or knowledge can be coded with multiple styles. Coding a

    single domain with multiple styles will result in a much larger complexity due to a

    wide range of unknown communication participants. Resource Description

    Framework (RDF) is a framework and standard data interchange model specified by

    the family of World Wide Web Consortium (W3C) for modeling and representinginformation [2]. The basics of Resource Description Framework focus on the

    statement From machine readable to machine understandable and have two

    important parts i.e. RDF Model and RDF Syntax which are further discussed in

    Section 2.1. Figure 1 shows the famous Semantic Web layer cake where Resource

    Description Framework (RDF) can be seen as a significant block.

    When thinking about Semantic Web one has to consider the data storage and

    management. The data management step of Semantic Web hasnt been a famous topic

    among researchers but now that the area has matured over time; many researchers

    have proposed many ideas about storage and management of Semantic Web data

    model i.e. Resource Description Framework RDF [2]. The diverse data models of

    Semantic Web demands a totally new way of storing data. In RDF the information is

    captured in the form of statements and those statements are represented with the help

    of (subject, predicate, object) or (subject, property, value). For example, a simple

    statement Technical University Kaiserslautern is located in Kaiserslautern,

    Germany can be represented using the directed graph in Figure 2 and can be

    represented as (subject: Technical University Kaiserslautern, predicate: isLocatedIn,

    object: Kaiserslautern) triple. More triples on the resource/subject i.e. Technical

    University Kaiserslautern in our example can be created which will results in a

    complete set of information. This information is first broken into statements and then

    translated into triples. These triples then can be stored in different ways. In this paper,

  • 8/3/2019 Rdf, Sw, Sparql Final

    3/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 3

    in Section 3 we will discuss some of the known data management techniques

    including their architecture, effects of querying them and their performance.

    Fig. 1. Semantic Web layer cake

    Fig. 2. Directed Graph for RDF statement.

    Another significant block that can be seen in the Semantic Web layer cake (Figure

    1) is the Rules/Query block which has its own importance. Once information

    representation and data management steps are completed its a very gruesome task to

    fetch information we need specifically. There are multiple ways to query RDF, one of

    the known query language support for Resource Description Framework (RDF) is

    SPARQL which is a recursive acronym for SPARQL Protocol and RDF QueryLanguage. SPARQL is not the only available query support for RDF, many other

    flavors also exist i.e. RQL, SeRQL, TRIPLE, RDQL, N3 and Versa. Many data

    management engines and stores have their own query support. SPARQL is considered

    to be a key Semantic Web technology and is a W3C Recommendation because of its

    capability to query on variety of data sources, whether the data is stored natively as

    RDF or it is being viewed as RDF with the help of additional middleware [3]. The

    detailed syntax and structure of both RDF and SPARQL are explained in Section 2.1

  • 8/3/2019 Rdf, Sw, Sparql Final

    4/18

    4 Muntazir Mehdi

    and 2.2 respectively. SPARQL is also briefly discussed in Section 3 where we talk

    about the management of RDF data.

    2 RDF & SPARQL: Concepts, Syntax and Structure2.1 RDF

    We have had a brief introduction about Resource Description Framework (RDF)

    above. In the following Section we will have a detailed look at RDF. Since RDF is

    composed of two important parts i.e. RDF Model and RDF Syntax, let us start with

    explaining those things in little more detail in the following sections.

    RDF Basic Concepts

    For the sake of understanding let us once again consider a simple example where

    we try to state some information about something:

    TU Kaiserslautern is located in Kaiserslautern

    For human understanding, this statement about TU Kaiserslautern is simply

    described in simple English. By looking at above statement one can say that the

    statement can be broken down into different parts and for understanding each part the

    statement each part of the statement should be identified. In our example we see that

    the statement is being made about TU Kaiserslautern which is a university, it is

    located in some place and the place is Kaiserslautern . For sake of identification letus reformat the statement and write in other words where TU Kaiserslautern can be

    easily and uniquely identified as a standalone entity since there may a lot of

    universities which are located in Kaiserslautern.

    http://www.uni-kl.de is located in Kaiserslautern

    Now let us once again break down the information and see which blocks constitute

    to describe the statement.

    The statement is made about a thing i.e. http://www.uni-kl.de The statement has a property concerned with the thing it explains i.e. is located in. The property of the thing has a value i.e. Kaiserslautern.

    Since the statement has been broken down, each part of the statement can be

    individually identified. In our case the thing/resource/subject is http://www.uni-

    kl.de, the property/predicate attached to it is isLocatedIn and the value/object for

    the property is Kaiserslautern. More information about this resource can be made

    again using simple English sentences.

    University of Kaiserslautern has a department of Computer Science

    http://www.uni-kl.de was founded in july, 1970

    Note that all statements made above have information for a single subject i.e. TU

    Kaiserslautern but the problem is that the subject is mentioned in three different

    ways. The main idea of RDF is to describe resources, where resources have properties

  • 8/3/2019 Rdf, Sw, Sparql Final

    5/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 5

    and those properties have values. RDF uses a specific terminology for dealing with

    parts of a certain statement [2]. The part where the statement describes a resource is

    called subject, the part of the statement which states the property or characteristic of

    subject is called property/predicate and the part of the statement which addresses the

    value of the predicate or property is called value/object. For example in our statement:

    Subject = TU Kaiserslautern/http://www.uni-kl.de/University ofKaiserslautern.

    Predicate/Property = locatedIn/hasDepartment/foundedOn. Object/Value = Kaiserslautern/Computer Science/July, 1970.

    As we know that till now we have been talking about human understanding, in

    order to make this information processable for machines RDF requires the followingtwo important things [2]:

    1.A language that is already processable by machines and can represent andexchange these statements.

    2.A system of identifying each part of the statement without any ambiguity toidentify them with resources available on web and is also machine processable.

    The World Wide Web has already two solid mechanisms for identification which

    are already machine processable. The Uniform Resource Locater (URL) which

    specifies the location of resource and Uniform Resource Identifier (URI) which

    itself is a super set of URL and can be created by any organization or person

    independently. When we look at our example, luckily we have a resource that has aURL identifier i.e. http://www.uni-kl.de but what about resources which have no

    web location or URLs e.g. Credit card, human beings, telephone bill. URIs have the

    capability to identify such resources which are 1) on the web, 2) not on the web and

    3) abstract concepts.

    RDF Information can be written easily by anyone independently using XML [2].

    RDF defines a specific programming language for the representation of information,

    since XML is already a machine processable and exchangeable format, RDF uses a

    variation of XML i.e. RDF/XML which follows a simple syntax similar to XML.

    There is another relatively new serialization and structure for RDF data representation

    named Turtle [15] is available that has become very much famous for SPARQL query

    syntax but in this paper we will only discuss RDF/XML.

    RDF Model

    RDF data can be represented in the form of triples which follow a certain pattern or

    can be represented in the form of a directed graph.

    Now that we know that a statement can be broken down into parts and these parts

    can be identified using URIs, we can use RDF triples to represent the information. An

    example of such can be seen as follows:

  • 8/3/2019 Rdf, Sw, Sparql Final

    6/18

    6 Muntazir Mehdi

    "July,

    1970".

    The graph representation of such information can be seen in the following figure:

    The key representation notation is that the resources are represented using oval

    shapes, predicates are represented using directed graph edges and literal values are

    represented using rectangles. A single arc represents a single triple; a triple consists of

    subject, predicate and object (which can also be a resource). The graph evolves since

    objects which serve as resource themselves can have their own properties and

    properties.

    Another notable thing that can be used while representing information is that the

    URIs are mostly long strings, therefore for making it look symmetric and easily

    understandable RDF provides with the use of prefixes. This substitution is made using

    XML references which are added in the beginning. A fully qualified name of URI is

    substituted using XML prefix. A simple example can be seen as follows:

    Prefix: ct, namespace URI: http://www.abc.com/customTypeThus the predicate becomes:

    ct: isLocatedIn, ct: hasDepartment, ct: foundedOn.

    RDF Syntax

  • 8/3/2019 Rdf, Sw, Sparql Final

    7/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 7

    As discussed earlier, RDF uses XML structure for the representation of

    information, however the flavor of XML that RDF uses is a totally different

    specification. The RDF names it as RDF/XML [2].

    For understanding RDF let us look at the code example given below for the

    following piece of information that is represented using RDF/XML syntax.

    Technical University Kaiserslautern is located in Kaiserslautern. The university

    has department of Computer Science. It was founded in July, 1970.

    July, 1970

    The very first line i.e.

    indicates the content following this line is in XML.

    The piece of code that says

  • 8/3/2019 Rdf, Sw, Sparql Final

    8/18

    8 Muntazir Mehdi

    ern" /> and those properties which have their values as another rdf resource will

    be represented in the form of .

    marks the end of the RDF content.

    A single RDF may contain information about more than one resources, all of them

    separated by their respective tags.

    Above we presented the very basic syntax and structure using an example. A

    detailed syntax and specification can be further seen in [2].

    2.2 SPARQLA lot of work has been done to develop a query that fulfills the requirements of all

    Semantic Web standards. The race is always towards creating a query language that isvery much similar to SQL and has the potential to deal with Semantic Data. Many

    query languages have been proposed in literature (all having their own pros and cons);

    SPARQL [3] has proved to be a query language that has been a center point for many

    researchers.

    RDF doesnt only emulate the SQL syntax but it also support full pattern matching,

    optional pattern matching, conjunction and disjunction. The only fallback from

    SPARQL that can be observed very easily is its inability to alter RDF stored data.

    As we already know that the basic idea of RDF is representation of information in

    form of RDF triple consisting subject, predicate and object, SPARQL is not an

    exception. SPARQL is also built on the same triple pattern.

    SPARQL Syntax and Structure

    In this Section we will have a look into the basic syntax and structure of SPARQL

    query. For understanding let us look at a query example given below:

    PREFIX ct:

    SELECT ?name

    WHERE

    {

    ct:hasDepartment ?name.

    }

    Let first see what happens when this query is executed on the RDF data we have

    been using till now. The output of the query would be:

    If we break down the query mentioned in the code segment above, we will be able

    to explain each and every part in detail. The very first line in the code segment begins

  • 8/3/2019 Rdf, Sw, Sparql Final

    9/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 9

    with the keyword PREFIX. When we were dealing with RDF triples, we had to

    identify each part of the statement using an identifier, but identifiers tend to be large

    strings. Therefore, we used XML namespaces with prefixes through which we were

    able to reduce long strings. The keyword mentioned here is equivalent to that. The

    first and foremost part is to declare any prefixes. A single query can have multiple

    prefixes used wherever necessary.

    The Second line of the code segment has SELECT keyword. As we have already

    discussed the SPARQL draws its roots from standard SQL. The SELECT keyword

    marks the beginning of the query. It has the same concept of that in SQL; it defines

    the variable which we want the query to return us also it binds the variable to the

    output we expect to receive.

    The FROM keyword which is not mentioned in our example since we are using the

    local RDF data, once again works in the similar fashion of that of SQL. This keywordidentifies the dataset on which we want our query to be executed. It can identify a

    local file as well as a remote file.

    In last we have WHERE keyword. As we know both representation of RDF data

    i.e. triple & graph. The match between graph or triple representation is made between

    the pattern we have and the pattern we specify in the braces after WHERE. A

    WHERE clause can have more than one pattern specified to it, each separated by a

    dot at the end. The WHERE clause is also optional as in case of SQL and can easily

    be omitted.

    In the above example and explanation we have seen the very basic syntax and

    structure of SPARQL, a detailed specification for SPARQL and more complex

    queries with additional query possibilities see [3].

    3 Data ManagementIn this Section we will discuss some data management techniques for Resource

    Description Framework (RDF) which includes a state of the art relational DBMS data

    storage solution for RDF i.e. Sesame [4], a performance enhancing and data model

    decomposition [9] technique for storage of RDF data i.e. Vertically Partitioned

    Approach [5, 6] and an engine implementation which follows a RISC-Style

    architecture for achieving high performance through SPARQL queries applied on

    RDF data i.e. RDF-3X [7, 8].

    3.1 SesameSesame is a standard framework for processing RDF data. It is an open source java

    framework for storing, querying and reasoning about RDF and RDF schemas. It can

    be used for both as database storage and java library for developing application to

    work with RDF and RDF schema. The implementation of Sesame follows a generic

    architecture i.e. Sesame Architecture [4] which is further discussed in this paper

    below. The implementation of Sesame has been designed carefully with flexibility to

    support variety of storage systems (relational databases, in-memory, file systems) and

  • 8/3/2019 Rdf, Sw, Sparql Final

    10/18

    10 Muntazir Mehdi

    offers a wide range of tools to developers to utilize the power of RDF and Semantic

    Web standards. Sesame also includes support for SPARQL over both local and

    remote stores access transparently with same API.

    A packaged product and source code for Sesame can be downloaded from

    http://www.openrdf.org/download.jsp.

    Sesames Architecture Overview

    The overall architecture of Sesame can be seen in Figure 3, individual components

    are further explained here.

    The RDF data i.e. the final output of Sesame in Sesame architecture is stored in a

    scalable repository; RDF is stored in various different ways depending on the

    selection of repository. A DBMS suits this condition very well. We already know that

    there are a wide range of DBMS systems available, each of them having their own

    features supporting their usefulness and strength. The Sesame is implemented in a

    DBMS-Independent fashion. In order to achieve it, the code specific to DBMS is

    concentrated in a single architectural layer i.e. the Storage and Interface Layer

    (SAIL). This layer serves as a client to the main functional modules of Sesame. SAIL

    is just an API which is responsible for translating the RDF specific requests made by

    the functional modules to their specific DBMS. The functional modules are further

    discussed below.

    The packaging of Sesame is done in a manner that it can be implemented as both

    web application and a web service. The packaged implementation is deployed on a

    web container supporting java servlets and then can be accessed via HTTP/S or

    SOAP. For scalability purposes handlers for each way of communication are added

    separately. An additional protocol handler can be added for accessing Sesame via adifferent way of communication. The request router is responsible for receiving

    request from protocol handlers and routing the requests to the respective functional

    modules and vice versa.

    http://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsphttp://www.openrdf.org/download.jsp
  • 8/3/2019 Rdf, Sw, Sparql Final

    11/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 11

    Fig. 3.Sesames Architecture

    Sesames Functional Modules

    The Query Module:

    The Query module used in Sesames implementation uses RQL [10] but in a

    different fashion which corresponds very much to W3C recommendations, with

    support of domain and range restrictions to both optional and multiple. Some people

    also name it SeRQL acronym for RQL (RDF Query language) for Sesame. But in this

    paper we will not consider any specific query language for explaining this module.

    The path that is followed by this module while responding to a request is shown in

    figure 4. This model carries out two important functions on a query i.e. Parsing and

    Optimization. While dealing with a query the module initially parses the query and

    creates a query tree model. This tree is then forwarded to an optimizer which creates

    an optimized version of the query tree model.

    For example a SPARQL query can be translated into an SQL query, optimized

    with respect to the underlying DMBS and then forwarded for execution.

    Fig. 4. The Query Module flow path

    The Admin Module:

    The two main functions of the Admin module are to incrementally add RDF(S)

    data into repository and clean up the repository. For populating the repository with

    information extracted from RDF(S) a simple process is followed. Generally the

    RDF(S) data is available online or locally in form of serialized XML (the extensions

    may vary; both .xml and .rdf(s) are applicable). Many parsers are available to extract

    data from these serialized XML files e.g. Jenna toolkit. The parser receives the XML

    file and after parsing the information produces the data in the form of (subject,

    predicate, object) or (subject, property, value) RDF triples. The admin module than

    communicates with SAIL and inserts the data in to the repository. Reporting of errors

    and warnings is also the responsibility of this module.

    The RDF Export Module:

    The simplest part of the Sesame Architecture is the Export Module. This module is

    only responsible for exporting RDF(S) data. Schema information is useful for some

    tools and RDF data is useful for some tools and in some cases both schema and data

    are useful depending on scenario. Based on the request made, this module has the

    capability to export the schema or data or both. After communicating with SAIL the

  • 8/3/2019 Rdf, Sw, Sparql Final

    12/18

    12 Muntazir Mehdi

    schema receives the triples data and produces a serialized XML formatted file. This

    enables Sesame to be integrated with other RDF tools.

    3.2 Vertically Partitioned ApproachThe Vertically Partitioned Approach [5, 6] is an alternative approach to Property

    Table. In order to understand this approach lets first have a basic understanding of

    Property Table.

    Usually RDF data is parsed into (subject, predicate, object) or (subject, property

    value) triples first and then it is fed into RDBMS. Since many literals are mostly large

    strings and it is inefficient to apply a pattern based querying on it; an approach to

    reduce them is used to further increase the performance. In this approach a simple

    mapping is created between literals which are very long and an identifier table [13]. Asimple example for storing RDBMS data into one table can be seen in figure 5(a).

    However the drawback of this simple and straight forward approach is query

    processing time it takes to retrieve results from the store. For this purpose, researchers

    at Jena Semantic Web toolkit, Jena2 [11, 12] proposed property table concept which

    is considerably efficient for query processing. The proposal contains two types of

    property table. The first type of property table is known as Clustered Property Table

    where clusters of properties which are common to most subjects are grouped together

    and a table is formed, the rest of the triples are inserted into a table which is same as

    that of RDBMS. An example of this can be seen in figure 5(b). The second type of

    property table known as Property-Class Table uses the property part of an RDF triple.

    This type of property table creates classes based on properties which are very much

    common among subject and groups those subjects into individual tables. Again theleft over triples are stored using the same technique as in RDBMS and Clustered

    Property Table. This technique of storing triples with respect to their classes has been

    found useful by Jena2 and is also very much effective while storing reified

    statements. Reification in Semantic Web is defined as Statement about Statement for

    example one statement is Earth revolves round the Sun and another statement that

    reifies this statement is Scientists believe that earth revolves round the sun. While

    storing reified statements, RDF:Statement is considered as class and the properties are

    RDF:Subject, RDF:Property and RDF:Object. Example for Property-Class Table is

    also shown in figure 5(c).

    Now that we have seen the Property table technique, let us look into the alternative

    approach that enhances the query performance by using fully decomposed storage

    model [9]. The Vertically Partitioned Approach is a very simple and straight forward

    approach where the tables are created by using the unique properties in the data. All

    unique properties from triples are extracted and then inserted into respective tables.

    The table consists of two columns, first column is the subject and the second column

    is for property value. The most interesting and performance enhancing part of this

    approach is the sorting that is applied on the subject column of individual tables. This

    enables locating the subject quickly hence fast merge joins can be used to construct

    the required information about multiple subsets of subjects. An example of this

    approach is shown in figure 6.

  • 8/3/2019 Rdf, Sw, Sparql Final

    13/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 13

    The Vertically partitioned approach has several advantages over Property Table

    technique. Some of them are listed below:

    Support for multi-valued attributes: Those subject which have more than one property

    value for a particular property can be easily stored in decomposed storage model. The

    technique is to add the values in succession.

    Support for Heterogeneous records: This is the biggest advantage of vertically

    partitioned approach over property table. When dealing with unstructured data or

    poorly structured data there are always possibilities of missing property values among

    subjects. The idea here is to simple omit them while populating the table or in simple

    words Null values need not to be mentioned anymore.

    Fig. 5. RDF Triple data and property table examples

    Certainly this approach has its own disadvantages in some scenarios, however it

    has been observed that when compared to property table technique this approach has

    upper hand. A detailed performance comparison for both of these approaches can be

    seen in Section 6 of [5, 6].

  • 8/3/2019 Rdf, Sw, Sparql Final

    14/18

    14 Muntazir Mehdi

    Fig. 6. Vertically Partitioned Approach example

    3.3 RDF-3XRDF-3X as the name suggests is an engine implementation which covers 3 salient

    features. 1) The implementation follows a generic solution for implementing storage

    of RDF data in a manner that no further tuning should be required, 2) a query

    processor and 3) a query optimizer.

    Storage and Indexing

    Triples Store and Dictionary:

    As discussed earlier, the current state of art schema for storing RDF data is

    Property Table but here once again the engine will use a simple approach where all

    RDF data is extracted in form of (subject, predicate, object) or (subject, property,

    value) triples. Once RDF triples are extracted they are stored in a repository. The

    repository is custom storage implementation instead of using RDBMS. This supports

    the concept of using RISC-Style and design principle. As mentioned earlier in Section

    3.2 the costs of directly storing RDF triples in a single table, the engine

    implementation defends the criticism that a single table incurs too many self-joins by

    creating indexes which prove to be very efficient.

    Once again the notion of using a dictionary which is a mapping of large string

    literals to an identifier table is used as before. The cost for this would be indexing the

    dictionary however this will gain two benefits i.e. 1) compression of triple store and

    2) simplification for query processor. All triples are stored in clustered B-tree and the

    tree is sorted alphabetically. The use of this data structure will help in conversion of

    SPARQL patterns into range scans. Another advantage is when a specific pattern

    matching is applied, the binding to every unknown literal can be found in a single

    scan in logarithmic amortized time.

    Compressed Indexes:

    While applying pattern matching on triple store we always rely on the fact that the

    pattern is always supplied in the standard format. However in many cases patterns can

    have different forms. For sake of producing results in one scan of any supplied pattern

    in any order, the engine uses permutations of all three: subject, predicate and object.

    This will ultimately result in six different results for a single triple; however the

  • 8/3/2019 Rdf, Sw, Sparql Final

    15/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 15

    engine overcomes the redundancy by applying the compression. The standard

    ordering of the pattern is (subject (s), predicate (p), object (o)). The six possibilities

    for a single triple then become (SOP, SPO, OSP, OPS, POS, PSO). While storing

    each permutation in leaf page of clustered B-Tree, each permutation is first sorted

    alphabetically. A detail on compressing the indexes and compression algorithm used

    with comparison to other algorithms can be found in [7, 8] in Section 3.2.

    Aggregated Indices:

    Additional aggregated indexes are created where two out of three columns of a

    triple are considered. In other words it can also be said that two entries from a set of

    three possible entries are extracted along with a count i.e. the number of occurrence of

    this pair in all set of triples. This is done for all six possible permutations and then

    stored in the database. The compression is applied once again and the effect of addingthem seems almost negligible. The same is done for a single entry, where a single

    column is considered and a count is kept and then stored. The compression once again

    enables the effect of storing them negligible. The reason behind using aggregated

    indexes is simplifying translation of query. As from many SPARQL query patterns it

    can be observed that partial triples are sufficient.

    Query Processing and Query Optimization

    Translating SPARQL Queries:

    In order to optimize the query it is necessary to first transform it into calculus

    representation. A query graph representation is constructed which can be used as

    relational tuple calculus since it is easier to optimize. Every supplied query is firstparsed and expanded into set of triples. A triple consists of either literal or variable.

    The mapping of literals is done using the dictionary concept used earlier and ids are

    retrieved.

    When supplying conjunctive query; while expanding the query into set of triples if

    the query consists of a single triple than a single result is retrieved and forwarded, if

    the set consists of more than a single triple than a join ordering (further discussed

    below) is used and results of individual query results are joined and then returned.

    Each triple pattern corresponds to the respective node in the graph we constructed in

    the beginning. While matching; each node is applied to the database and results are

    retrieved in a single range scan. For more than a single variable in query tree, each

    variable binding requires one single scan.

    Duplicates are eliminated by using an aggregation operator when a distinct clauseis used in query. Finally the ids are transformed back into strings by using the

    mapping dictionary of identifiers.

    Optimizing join ordering:

    Join ordering is one of the most important issue in optimizing query plans. Many

    methods exist for solving this issue however almost none of them have tried to solve

    the demanding properties of joins created by intrinsic characteristics of RDF and

    SPARQL. The three noted properties or requirements observed by [7, 8] are:

  • 8/3/2019 Rdf, Sw, Sparql Final

    16/18

    16 Muntazir Mehdi

    Sub queries of SPARQL query tend to be star-shaped, for combining severalattribute like properties of same entity. Therefore they require a strategy which

    focuses more on bushy trees rather than left-deep or right-deep trees.

    The occurrences of these star joins happen to be on the nodes of long join pathmostly on start or end of the path. More than 10 or more joins can be easily lead

    by a SPARQL query. Therefore shift to heuristic approximation or fast plan

    enumeration would produce exact optimization.

    Since a very strong set of triple indexes have been produced and stored in thedatabase, hence these indexes should be used with their full advantage; which

    requires extensive use of joins but keeping in mind preserving the orders in

    creation of join plans.

    All of the above mentioned properties rule out the most notable methods used

    earlier for optimizing the query plan. The first property will disable all those methods

    which generate star shaped chains. The second property restricts the use of

    transformation based top-down enumeration allowing only use of a bottom-up

    method. The third property rules out the use of sampling-based plan enumeration as

    they have the lowest chances of producing query plans in proper order preserved for

    more than 10 joins.

    The proposed solution which results in exact optimization of the query plans

    addressing all three above mentioned properties uses the bottom-up dynamic

    programming framework of [14]. The technique is further discussed in Section 4.2 of

    [7, 8].

    Handling Disjunctive Queries:

    SPARQL has support for both conjunctive and disjunctive query types. RDF-3Xengine doesnt very much focus on disjunctive queries however it supports

    optimization of these queries at some level. The UNION expression of SPARQL

    results the union of the bindings generated by two or more groups of patterns applied.

    The OPTIONAL expression returns the binding of the pattern group in case there

    exists a result or returns NULL in case there is no result. In any case both UNION and

    OPTIONAL expressions are considered as nested sub queries of SPARQL first for the

    sake of optimization. First these nested sub queries are optimized and then these

    optimized sub queries are considered as base relation to optimize the outer query.

    The RDF-3X engine also has the capability of preserving the cardinality. While

    SPARQL query, when optimized with RDF-3X optimizer can result in many records,

    the standard SPARQL semantics demands that the right number of bindings areproduced so one has to take care of duplicates generated after the query has been

    executed. This is done by scanning the indexes, those indexes which are not

    aggregated will produce multiplicity of 1, while those indexes which have been

    aggregated will result the count in shape of multiplicity which we already stored.

    The RDF-3X engine, due to its complex algebraic operators has some very

    cumbersome implementation issues, however it defends its worth by providing 2

    concrete benefits i.e. its a RISC-Style implementation and when compared to other

    systems, the query execution time has a drastic performance difference.

  • 8/3/2019 Rdf, Sw, Sparql Final

    17/18

    RDF TRIPLE STORES, SPARQL AND THE SEMANTIC WEB. 17

    4 ConclusionIn this paper we discussed the upcoming Semantic Web that has proven to be a

    necessity for current Web Architecture. We also saw some of the most important

    standards that build up the power of Semantic Web. We further discussed RDF and

    SPARQL at very basic level and understood the basic syntax and structure.

    After that we explained few of the available techniques for storing RDF data and

    noticed them with respect to the query performance. We first had a look into the most

    generic architecture which can be followed for Semantic Web data management. We

    went in details of constituting parts of the Sesame Architecture. Than we noticed the

    problems that may arise while storing data in a single table and explained a technique

    which is more advanced than the currently famous Property Tables. We saw in

    vertically partitioned approach the possibility of increasing the performance whenRDF data was queried and an alternative to store RDF data other than Property Table.

    In last we discussed a totally different type of implementation of an engine that was

    able to improve storing RDF data effectively, parse the query to optimize its basic

    plan and saw the efficient querying. This engine implementation used RISC-Style

    architecture for storing RDF data and used a very complex set of algebraic operators

    to optimize the query.

    5 References[1] T. Berners-Lee, J. Hendler, O. Lassila. The Semantic Web. Scientific American, May

    17 2001, 34-43.

    [2] Graham Klyne, Jeremy J. Carroll, Brian McBride. Resource Description Framework

    (RDF): Concepts and Abstract Syntax, W3C Recommendation. 2004.

    [3] Eric Prud'hommeaux, Andy Seaborne. SPARQL Query Language for RDF. W3C

    Recommendation. 2008.

    [4] Jeen Broekstra, Arjohn Kampman, Frank van Harmelen. Sesame: A Generic

    Architecture for storing and querying RDF and RDF Schema. First International Semantic

    Web Conference Sardinia, Italy, June 912, 2002, Pages: 54-68.

    [5] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. Scalable

    Semantic Web Data Management using Vertical Partitioning. VLDB '07 Proceedings of the

    33rd international conference on Very large data bases, 2007, Pages: 411-422.

    [6] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach. SW-Store: a

    vertically partitioned DBMS for Semantic Web data management. The VLDB Journal - The

    International Journal on Very Large Data Bases Volume 18 Issue 2, April 2009, Pages: 385-

    406.[7] Thomas Neumann, Gerhard Weikum. RDF-3X: a RISC-style Engine for RDF.

    Proceedings of the VLDB Endowment Volume 1 Issue 1, August 2008, Pages: 647-659.

    [8] Thomas Neumann, Gerhard Weikum. The RDF-3X Engine for scalable management

    of RDF Data. The VLDB Journal - The International Journal on Very Large Data Bases

    Volume 19 Issue 1, February 2010, Pages: 91-113.

    [9] G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In proceeding

    of SIGMOD, pages: 268-279, 1985.

  • 8/3/2019 Rdf, Sw, Sparql Final

    18/18

    18 Muntazir Mehdi

    [10] Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis.

    RQL: a declarative query language for RDF. WWW '02 Proceedings of the 11th

    international conference on World Wide Web, Pages: 592603.

    [11] K. Wilkinson. Jena property table implementation. In SSWS, 2006.

    [12] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds. Efficient RDF Storage and Retrieval

    in Jena2. In SWDB, pages: 131-150, 2003.

    [13] E. I. Chong et al. An efficient sql-based rdf querying scheme. In VLDB, 2005.

    [14] G. Moerkotte, Thomas Neumann. Analysis of two existing and one new dynamic

    programming algorithm for the generation of optimal bushy join trees without cross

    products. In VLDB, 2006.

    [15] David Becker, Tim Berners Lee. Turtle Terse RDF triple language. W3C

    recommendation, 2011.