12
Semantic similarity method for keyword query system on RDF Minho Bae a , Sanggil Kang b , Sangyoon Oh a,n a Department of Computer Engineering, Ajou University, Suwon, Republic of Korea b Department of Computer and Information Engineering, Inha University, Incheon, Republic of Korea article info Article history: Received 31 October 2013 Received in revised form 11 April 2014 Accepted 23 April 2014 Available online 6 July 2014 Keywords: Keyword query RDF Semantic similarity WordNet abstract Keyword query on RDF data is an effective option because it is lightweight and it is not necessary to have prior knowledge on the data schema or a formal query language such as SPARQL. However, optimizing the query processing to produce the most relevant results with only minimum computations is a challenging research issue. Current proposals suffer from several drawbacks, e.g., limited scalability, tight coupling with the existing ontology, and too many computations. To address these problems, we propose a novel approach to keyword search with automatic depth decisions using the relational and semantic similarities. Our approach uses a predicate that represents the semantic relationship between the subject and object. We take advantage of this to narrow down the target RDF data. The semantic similarity score is then calculated for objects with the same predicate. We make a linear combination of two scores to get the similarity score that is used to determine the depth of given keyword query results. We evaluate our algorithm with other approaches in terms of accuracy and query processing performance. The results of our empirical experiments show that our approach outperforms other existing approaches in terms of efciency and query processing performance. & 2014 Elsevier B.V. All rights reserved. 1. Introduction The RDF data model is a major part of the Semantic Web architecture, and more RDF data than ever before are available to represent Web data. As a response to such popularity, there has been a recent explosion in the number of proposals for the design of an RDF data management systems that indexes and processes RDF data sets [1]. However, most of these proposals have not yet completely solved the scalability or efciency problems inherit from the RDF graph model. As the amount of RDF data is quickly approaching the Web scale, these problems are increasing. Therefore, many of the proposals and systems limited to single-machine storage with RDBMS are not feasible to support this scale of RDF data. Although some of these proposals offer distributed storage that scale as the data size grows [2,3], the scalability issue is one of the important research challenges remaining to be solved. Query efciency is another challenging research issue. The majority of current proposals supporting the complex SPARQL language, the W3C recommended query language for RDF [4]. However, SPARQL expressions are short for meaningful semantic queries [5] and too complex for simple information retrieval from large RDF collections. Several recent proposals have focused on keyword query concept that does not require any knowledge about the target RDF data or prior knowledge on the SPARQL query language. As in a keyword search on the World Wide Web (WWW) through a popular search engine (e.g., Google or Bing), a keyword query on RDF data sets requires only keywords instead of triple patterns in SPARQL (i.e., a triple with variables such as o?x foaf:name Jon Foobar 4) that is formalized based on the RDF structure (i.e., triples of subject, predicate, and object). A keyword query is a more intuitive way of specifying information needs. For massive- scale RDF data sets, processing a formal query such as SPARQL requires tremendous number of computations and produces large amounts of intermediate data for a join operation. As a result, a keyword query with an optimized indexing structure has become a promising design for a Web-scale RDF data management system. However, processing a keyword query on a graph-based RDF data set is a non-trivial task. Although it is relatively easy to locate the keyword in the data set, it is a complex and complicated process to provide ranked results of the keyword query [6]. In this paper, we identify challenges that need to be tackled to use of a keyword querying paradigm for a massive-scale RDF data set. For example, how do we determine the size of the query results (i.e., sub-graphs) that will be returned to the user? In other words, how much search depth (or ranking) do we need to satisfy users? This is a necessary question to address because, unless we naively return the whole sub-graph connected to the keyword, we need a specic depth for each keyword so that the search process (i.e., query processing) can be halted at a certain point and the results returned to the users. For information retrieval on the WWW, Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2014.04.062 0925-2312/& 2014 Elsevier B.V. All rights reserved. n Corresponding author. Tel.: þ82 31 219 2633; fax: þ82 31 219 1725. E-mail address: [email protected] (S. Oh). Neurocomputing 146 (2014) 264275

Semantic similarity method for keyword query system on RDF

Embed Size (px)

Citation preview

Page 1: Semantic similarity method for keyword query system on RDF

Semantic similarity method for keyword query system on RDF

Minho Bae a, Sanggil Kang b, Sangyoon Oh a,n

a Department of Computer Engineering, Ajou University, Suwon, Republic of Koreab Department of Computer and Information Engineering, Inha University, Incheon, Republic of Korea

a r t i c l e i n f o

Article history:Received 31 October 2013Received in revised form11 April 2014Accepted 23 April 2014Available online 6 July 2014

Keywords:Keyword queryRDFSemantic similarityWordNet

a b s t r a c t

Keyword query on RDF data is an effective option because it is lightweight and it is not necessary to haveprior knowledge on the data schema or a formal query language such as SPARQL. However, optimizingthe query processing to produce the most relevant results with only minimum computations is achallenging research issue. Current proposals suffer from several drawbacks, e.g., limited scalability, tightcoupling with the existing ontology, and too many computations. To address these problems, we proposea novel approach to keyword search with automatic depth decisions using the relational and semanticsimilarities. Our approach uses a predicate that represents the semantic relationship between thesubject and object. We take advantage of this to narrow down the target RDF data. The semanticsimilarity score is then calculated for objects with the same predicate. We make a linear combination oftwo scores to get the similarity score that is used to determine the depth of given keyword query results.We evaluate our algorithm with other approaches in terms of accuracy and query processingperformance. The results of our empirical experiments show that our approach outperforms otherexisting approaches in terms of efficiency and query processing performance.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

The RDF data model is a major part of the Semantic Webarchitecture, and more RDF data than ever before are available torepresent Web data. As a response to such popularity, there has beena recent explosion in the number of proposals for the design of anRDF data management systems that indexes and processes RDF datasets [1]. However, most of these proposals have not yet completelysolved the scalability or efficiency problems inherit from the RDFgraph model. As the amount of RDF data is quickly approaching theWeb scale, these problems are increasing. Therefore, many of theproposals and systems limited to single-machine storage withRDBMS are not feasible to support this scale of RDF data. Althoughsome of these proposals offer distributed storage that scale as thedata size grows [2,3], the scalability issue is one of the importantresearch challenges remaining to be solved. Query efficiency isanother challenging research issue. The majority of current proposalssupporting the complex SPARQL language, the W3C recommendedquery language for RDF [4]. However, SPARQL expressions are shortfor meaningful semantic queries [5] and too complex for simpleinformation retrieval from large RDF collections.

Several recent proposals have focused on keyword queryconcept that does not require any knowledge about the target

RDF data or prior knowledge on the SPARQL query language. As ina keyword search on the World Wide Web (WWW) through apopular search engine (e.g., Google or Bing), a keyword query onRDF data sets requires only keywords instead of triple patterns inSPARQL (i.e., a triple with variables such as o?x foaf:name JonFoobar4) that is formalized based on the RDF structure (i.e.,triples of subject, predicate, and object). A keyword query is amore intuitive way of specifying information needs. For massive-scale RDF data sets, processing a formal query such as SPARQLrequires tremendous number of computations and produces largeamounts of intermediate data for a join operation. As a result, akeyword query with an optimized indexing structure has becomea promising design for a Web-scale RDF data management system.

However, processing a keyword query on a graph-based RDFdata set is a non-trivial task. Although it is relatively easy to locatethe keyword in the data set, it is a complex and complicatedprocess to provide ranked results of the keyword query [6]. In thispaper, we identify challenges that need to be tackled to use of akeyword querying paradigm for a massive-scale RDF data set. Forexample, how do we determine the size of the query results (i.e.,sub-graphs) that will be returned to the user? In other words, howmuch search depth (or ranking) do we need to satisfy users? Thisis a necessary question to address because, unless we naivelyreturn the whole sub-graph connected to the keyword, we need aspecific depth for each keyword so that the search process (i.e.,query processing) can be halted at a certain point and the resultsreturned to the users. For information retrieval on the WWW,

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.04.0620925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author. Tel.: þ82 31 219 2633; fax: þ82 31 219 1725.E-mail address: [email protected] (S. Oh).

Neurocomputing 146 (2014) 264–275

Page 2: Semantic similarity method for keyword query system on RDF

recalling the naïve return results from early search engines, a vastnumber of url returns is not a big problem. However, when Googleintroduced its pagerank algorithm [7] to rank the results from themost important first, the quality of the search results improvedastonishingly. Likewise, the depth of the search is a critical factorfor RDF querying. The key research challenges in this area thathave to be addressed are as follows:

□ How can we determine when to stop the search? What metriccan we use to halt the query processing? A ranked keywordsearch on graph-based data including RDF data poses certainchallenges. Approaches such as XRANK, XML-based keywordsearch [8] utilize a tree structure and the existence of the rootelement. However, for a keyword search on RDF data, the samestrategy cannot be applied since the graph structure is morecomplex (e.g., it is not hierarchical data storage and there is noroot). In addition, choosing the right metric for both rankingthe results and eventually halting the process is anotherchallenging research issue. Zhang et al. [9] proposed a multi-level semantic similarity method that measures the stringsimilarity between elements of the triple, statements, andgraph. While this approach is practical, we need a generalmethod that takes advantage of the semantic relationship ofthe keyword that can be derived from the predicate.

□ How to develop a comprehensive RDF data managementsystem to enable an automatic depth determining algorithmfor high-performance query processing at a massive-scale RDFdata. There are two main problems in managing massive-scaleRDF data. First, currently popular systems such as Virtuoso [2]and RDF-3x [10] are limited in scalability because they weredesigned to support a formal query, SPARQL, and use RDBMSfor storage. Second, many approaches to RDF data managementdo not fully utilize indexing. They simply use an index to locatethe stored vertex that contains the keyword [6].

Motivated by these problems, we propose a novel automaticdepth search algorithm for an RDF data management system. Ourmain contributions to these challenges are as follows:

� Automatic k-depth decision for a keyword search based onrelational and semantic similarities. Our algorithm automa-tically determines the depth of the search by calculating thesimilarity between the given keywords and vertices in the RDFgraph (i.e., RDF data set). In the query processing algorithm, wefirst search the keyword in the RDF data set using an index.Since the predicate represents the semantic relationshipbetween a subject and object, we take advantage of this tonarrow down the target RDF data. The similarity score is thencalculated for objects with the same predicate using the key-word vertex. We use the word distance for the relational similarityand measure the semantic similarity using WordNet [11] value todetermine the semantic similarity score of the object to the givenkeyword.

� A novel scalable and high-performance structure indexing.This algorithm is based on the indexing structure described inour previous work [12], which explored how to index massive-scale RDF data sets for formal queries such as SPARQL andkeyword queries. In particular, we extend the indexing struc-ture for a keyword query and modify the keyword queryingalgorithm to adapt the automatic k-depth using the similarity.

To empirically verify the effectiveness of our approach, weconducted experiments on an RDF data management systembased on our indexing and querying algorithm. In these experi-ments, we verified the effectiveness of our algorithm and demon-strate the process of computing the relational and semantic

similarity. Also, we compared our approach with a naïve keywordsearch approach in terms of both query response time andaccuracy.

We proceed as follows. In Section 2, we present relatedresearches of approaches that support a keyword query on anRDF data set, and algorithms for calculating the similarity instructured data. In Section 3, we present our approach for anefficient RDF keyword query processing with an automatic k-depthdecision. In Section 4, we present our empirical experiment resultsto demonstrate the effectiveness of our automatic k-depth algo-rithm for keyword queries on RDF data sets. Finally, we providesome concluding remarks and describe some future researchdirections in Section 5.

2. Related work

There have been a lot of proposals that provide a top-k keywordquery to structured or semi-structured data. In He et al. [6], BLINKs, abi-level indexing and query processing scheme was proposed tosearch the top-k keyword of a data graph. In addition, the authorsintroduced a bi-level indexing to reduce the abundant memory usageof previous approaches and improve the search performance. For this,the data graph is partitioned into a sub-graph (or block) rather thannaively storing whole graphs and the shortest path. A keyword list isbuilt first. The graph is then partitioned into sub-graphs using eachkeyword in the list and the breadth first search (BFS) algorithm. Sinceit only searches the sub-graph of the keyword, the overall queryprocessing performance is improved.

The keyword search system over structured data by Elbassuoniand Blanco [13] directly retrieves results for the keyword query.Using a keyword phrase, the authors use a sub-graph retrievalalgorithm to return a set of sub-graphs that match the querykeywords using their ranking model. To do so, for all triple data,they maintain a list of keywords derived from the subject- andobject-associated predicate. They then create an inverted index foreach keyword query with a list of corresponding triples so thatthey can join these data to obtain the sub-graphs of all keywordsin the query by adapting the backtracking algorithm [14]. Con-sidering and maintaining keywords from all three factors (subject,predicate, and object) requires time, memory, and a well-designedschema rather than only the literal values of the objects.

In Zhang et al. [9], various methods for dealing with thesimilarity in RDF graph matching was proposed. They use fourmethods and provided a formula for their integration. The firstmethod measures the level. If the node is closer, it has a bettersimilarity. The second method measures a string similaritybetween objects using WordNet. Finally, the third and fourthmethods measure the RDF data statement similarity and RDFgraph structure similarity, respectively.

An algorithm for the top-k exploration of sub-graphs to retrievethe top-k most relevant structured queries was proposed in Tranet al. [15]. In this proposal, a keyword index is used for elements,and a structure index is used for an RDF data graph. The keywordsin the query are translated into expressive formal queries. First,instead of mapping the keywords to the data tuples, they aremapped to the elements in the data graph using a keyword index.Next, with each mapping of such keyword elements, the summarygraph (i.e., structure index of the original RDF data graph) isexplored to search for the augmented query graph (substructure)connecting them. From these sub-graphs, conjunctive queries arecreated by mapping the edges with predicates, and the verticeswith variables or literal values. Their system then generates thetop-k queries using a scoring function, instead of computinganswers for the keywords. Finally, users need to select theirappropriate queries from these proposed structure queries to find

M. Bae et al. / Neurocomputing 146 (2014) 264–275 265

Page 3: Semantic similarity method for keyword query system on RDF

the necessary information based on their needs. This approach hasmany interesting capabilities. However, possible information lossoccurs because users customized with a small k parameter valueare an inherent drawback. In Ref. [6], the authors improved onprevious efforts with respect to the accuracy of the query resultsby introducing a new phase, which transforms a sub-graph queryinto a set of entities by traversing the sub-graph. However, thisnew approach requires more time to process, while improving theaccuracy of the results.

In many approaches to keyword searches [16,17], WordNet isused as a linguistic similarity measure source. WordNet is a lexicalEnglish database where the nouns, verbs, adjectives, and adverbsare grouped into sets (synsets). There are multiple methods forcalculating the semantic similarity between words in WordNet,which can be categorized as follows: edge-based methods [18,19];information-based statistics methods [20]; and hybrid methods [21].An edge-based method is a similarity measure using path linking(short distance and high similarity between two words). Information-based statistics methods are used when it is difficult to find a commonlink using an edge-based method. As the basic concept of this method,the more common links there are, the higher the similarity. A hybridmethod adapts both of the methods described above.

There are many similarity measurement systems that useWordNet and a proposed similarity algorithm [18–24] includingWordNet::Similarity [25], the Semantic Similarity System [26], andSimPack [27].

3. Keyword query processing on RDF with semantic similaritymethod

The k-depth is the metric we introduced in a previous work [12]to define the search depth. The k-depth represents the number ofrelations between nodes. However, like other keyword queryprocessing approaches, our previous work cannot determine thek-depth automatically and the user should provide it. In this paper,we extend our indexing structure to have an additional similaritycollection that is used for a similarity score calculation, andpropose a novel query algorithm that automatically determinesthe k-depth for matching the sub-graphs (i.e., a set of triples). Thesimilarity score between the keyword and name vertices of theRDF data set (graph) is the parameter that represents the user's

interest. If the similarity is higher, the vertex is more relevant tothe user's inquiry. The algorithm uses the predicate of the keywordobject to calculate the distance score and calculate the semanticsimilarity score of objects. The advantages of our approach includethe following: high accuracy of the query results and high-performance query processing by narrowing down the size oftarget data with predicate. An overview of our approach isdepicted in Fig. 1.

3.1. Structured indexing algorithm with collections

In this section, we describe an off-line indexing process whereRDF data sets are preprocessed and stored in the distributed datastorage in the collection form. To address the scalability issue andimprove the query performance problem, we adopt a key-valuestorage design [28] that can be scaled to support massive-scaleRDF data in a collection form. The RDF data sets are stored in theRDF management system (i.e., stored triples, triplestore) usingdistributed storage.

Updated from our previous work [12], the building process ofthe similarity collection is added to our indexing algorithm. Theindexing algorithm builds the following five collections: (1) avertex collection that stores subjects and objects (RDF allowssubjects and objects to be interchanged); (2) a predicate collectionthat stores predicate data; (3) a literal collection that stores literaldata; (4) RDF data collection, which stores the relations of data(RDF data collection is the most important among the fourcollections because it is the first indexing product of our algorithm,and form which the other three collections are derived), and inaddition to our previous work; and (5) a collection of similaritybetween literals.

3.1.1. CollectionsRDF data collection has vertex documents (entities in a collec-

tion that represent each value i.e., object or subject) and incomingand outgoing documents that represent data in a k-depth of 1. Thek-depth is a degree showing the number of vertex documentsrelated to the current vertex document. For example, http://foobar.xx/bob has a k-depth of 1 relationship with 010242125234, http://foobar.xxx/christ, bob, and http://foobar.xx/john of Table 1.

Fig. 1. Overview of indexing and keyword query processing.

M. Bae et al. / Neurocomputing 146 (2014) 264–275266

Page 4: Semantic similarity method for keyword query system on RDF

We represent an RDF data collection as a key-value pair:

ðvertex id; o incoming documents; outgoing documents4Þ ð1Þ

The pair consists of a vertex id as a key, and a pair of anincoming and outgoing document as a value. Each element ofincoming documents is a pair ofðpredicate id : a list of vertex idÞ.Each element of outgoing documents is a list of outgoing vertex id.We do not store predicates in an outgoing document because weare able to deduce its predicate through its incoming document.

If the stored vertex document in the RDF data collection is usedas a subject only, the vertex document is only stored in anoutgoing document. If the stored vertex documents is used as anobject only, they are to be only stored in an incoming document.However, there are cases where RDF data is both a subject and anobject. For example, as shown in Fig. 2, a vertex document “http://foobar.xx/bob”, is both a subject and an object at the same timeand it is stored in both an incoming and an outgoing document.The format of incoming and outgoing documents is modeled onJSON. The bracket “{}” is a set of key: value pairs and the bracket“[]” is a set of values. Table 2 is an example of RDF data collectionof the RDF data example in Table 1.

The main reason for this design is to improve the informationretrieval performance by grouping and storing the related datatogether. In Table 2, the elements are stored as numbers and flags,which will be explained in detail later.

Table 1RDF data example.

Subject Predicate Object

http://foobar.xx/bob foaf: phone 010242125234http://foobar.xx/bob foaf: know http://foobar.xx/christhttp://foobar.xx/bob foaf: name bobhttp://foobar.xx/christ foaf: name Christhttp://foobar.xx/john foaf: phone 01052325234http://foobar.xx/john foaf: name Johnhttp://foobar.xx/john foaf: know http://foobar.xx/bobhttp://foobar.xx/john foaf: based_near http://sws.geonames.org/3020251/http://sws.geonames.org/3020251/ gn: population 4242http://sws.geonames.org/3020251/ gn: postalCode 24,124http://sws.geonames.org/3020251/ gn: name Embrunhttp://foobar.xx/john foaf: birthday 12/22/2000

Fig. 2. RDF data graph of the example of Table 1.

Table 2RDF data collection of the RDF example in Table 1.

Vertices Up (incomingdocuments)

Down (outgoingdocuments)

[0] {“1”: [2],} [1,4,6][1] {“1”: [0]} [7][2] {} [0,3,8,9,11][3] {“3”: [2]} [5,10,12][4] {“0”: [0]} [][5] {“4”: [3]} [][6] {“2”: [0]} [][7] {“2”: [1]} [][8] {“0”: [2]} [][9] {“2”: [2]} [][10] {“6”: [3]} [][11] {“7”: [2]} [][12] {“5”: [3]} []

M. Bae et al. / Neurocomputing 146 (2014) 264–275 267

Page 5: Semantic similarity method for keyword query system on RDF

As mentioned, our RDF management system and queryingalgorithm utilize four collections of the given RDF data. Since thedata retrieval performance from the RDF documents becomesslower as we have more RDF data to store, we need an effectivemethod to reduce the amount of stored data. We propose aflagging and grouping technique for our indexing algorithm tooptimize the storage space. Table 3 shows three collections fromthe example in Table 1.

3.1.2. Flagging and groupingIt is more space-efficient to translate strings (vertex/literal) into

numbers, and store these numbers instead of native strings. To doso, we use a flagging technique (Flag). To apply this technique, weintroduce the predicate collection and vertex collection describedabove. The values in these two collections have correspondingstring objects. pID and vID are assigned to the predicate and vertexvalues in the collections respectively. In general, literal-type dataare used in objects, not in subjects, of RDF. Since we assume thatthe keywords used in a user query are literal, we build a literalcollection, as shown in Table 3, using literal data to expedite thesearch process. We identify literals from the RDFs using Jena [29].

The Max_Change_Subject is used to calculate the semantic dis-tance score for similarity and it is calculated when the Literalcollection table is built. The meaning and the use of Max_Sub-ject_Change are fully explained in the next section. Table 4 showsa grouping example of RDF example. We use an example differentfrom Table 1 since the example of Table 1 does not have anygrouping entity data.

3.2. Keyword query processing algorithm

As noted, an effective way to determine the optimal search depthfor keyword queries is critical for a practical use of a keyword query inan RDF management system. If the set of returned results with thek-th highest overall relevancy to the given keywords, i.e., from themost relevant triple to the k-th ranked triple, are too small, therelevant information may not be in the returned results. On the otherhand, if the results are too big, the quality of service of the RDFmanagement system will be low because of the lengthy queryingresponse time required to process a query over massive-scale RDFdata, as well as the low accuracy of the results. In this section, wedescribe our novel method for determining the k-depth using asimilarity between keywords and the name of the vertices in theRDF graph.

3.2.1. Semantic similarity scoresTo retrieve data semantically related to a user query with an

optimal k depth on RDF, we define the similarity score based ontwo measurements: a relational strength by using the keywordpredicate class and a semantic strength between two objects. Therelational strength is obtained using the predicate homogeneitybecause the predicate class explicitly describes the semanticrelation between the subject and object linked with the predicateclass. In Fig. 2, the subject http://foobar.xx/john has four predi-cates, foaf:phone, foaf:name, foaf:know, and foaf:base_near. Thepredicate class generally consists of predicates with similarsemantic meaning measured from the RDF vocabulary. For exam-ple, Friend of a Friend (FOAF) [30] is machine-readable Web pagesdescribing people. The links between vertex are therefore pre-dicates that describe the person, such as their name, phonenumber, and homepage. Likewise, the predicate of SemanticallyInter-linked Online Community (SIOC) [31] describes blog, forum,and mailing lists. In our proposed algorithm, the number ofsubject changes is used as a base unit for calculating the semanticdistance between two objects.

However, we may overlook an object in the data that is notclose to the keyword object in the semantic distance, but is similarin terms of semantic meaning with this measurement only.To solve this problem, we exploit the lexical meaning of objectsas another measurement. We use a measurement algorithm of alexical similarity between two objects, which is based on Lin'salgorithm [24] for searching objects that have high score ofsemantic similarity in the RDF data set. From following, we explainthe process of those two measurements in detail.

As described in Section 3.1, the RDF data set is indexed andstored as our proposed structure, i.e., collections. The measuringprocess of relational strength using the predicate is as follows:first, we obtain the ID of the literal data that matches with thegiven keyword to obtain the vertex ID (to locate the vertex) fromthe RDF data collection. Using the predicate IDs, we then obtainthe ID from the incoming list out of the predicate collection.We then obtain the list of predicates connected to the keywordobject.

After we get the predicate list, we calculate the distance score.Each arrow in Fig. 3 shows the change in the number of subject inan RDF. A change in subject indicates the number of subject from

Table 3Three collections of the RDF example in Table 1.

Vertex collection Predicate collection

vID Value pID Value

0 http://foobar.xx/bob 0 foaf: phone1 http://foobar.xx/christ 1 foaf: know2 http://foobar.xx/john 2 foaf: name3 http://sws.geonames.org/3020251/ 3 foaf: based_near4 010242125234 4 gn: population5 4242 5 gn: postalCode6 bob 6 gn: name7 Christ 7 foaf: birthday8 010523252349 John

10 Embrun11 12/22/200012 24,124

Literal collection

ID Value Max_Change_Subject

4 010242125234 35 4242 36 bob 37 Christ 38 01052325234 39 John 3

10 Embrun 311 12/22/2000 312 24,124 3

Table 4Grouping example of the RDF example.

Vertices Incoming Outgoing

[0] {“1”: [2,3], “2”: [4]} [5][1] {“1”: [2,3], “2”: [4]} [5]

↓Hypervertices Incoming Outgoing

[0,1] {“1”: [2,3], “2”: [4]} [5]

M. Bae et al. / Neurocomputing 146 (2014) 264–275268

Page 6: Semantic similarity method for keyword query system on RDF

the subject of a query keyword to that of an object, i.e., how manysubjects are encountered on the path from the query keyword tothe vertex (object). For example, in the figure, the change in thenumber of subject of the object “4242” for the query keyword“John” is 2 because there are two subjects, “http://foobar.xx/john”and “http://sws.geonames.org/3020251/”, between “John” and“4242”. It therefore has a changed subject number of 2. Accordingto our rationale above, we can measure the semantic distancescore (SDS) between objects. Also, we normalize raw semanticdistance in order to have one be the maximum value by using themaximum number of subject change as a basis, as seen in Eq. (2).

SDS ðk; oiÞ ¼ 1� NSCðk; oiÞ�1Max½NSCðk; o1ÞNSCðk; oNÞ�

ðk : keyword;

oi : object in comparisonÞ ð2Þwhere NSC(k, oi) is the number of subject changes between akeyword k and the ith object at sub-graph in the RDF and Max[NSC(k, o1) NSC(k, ON)] is the maximum value in the set of subjectchanges between the keyword k and all objects (i.e., from the firstobject denoted as o1 and the last object denoted as oN).

The other similarity measurement is the semantic similarity score(SSS) as mentioned above. As noted, we use WordNet in thiscalculation because there have been many related works usingWordNet for similarity studies, and many algorithms have beenproposed as a result. We conducted our experiments using variousalgorithms [18–20,23,32] empirically and selected the Lin's algorithm[32] as our base algorithm to modify. From the Lin's algorithm, thesemantic score between a query keyword and the ith object, denotedas k and oi, respectively, can be defined as seen in Eq. (3).

SSSðk; oiÞ ¼2� log rðc0Þ

log rðciÞþ log rðckÞðk : keyword;

oi : object in comparisonÞ ð3Þwhere r(ck) is the ratio of the number of classes including the k,denoted as ck, to total number of classes, r(ci) is the ratio of thenumber of classes including oi, denoted as ci, and r(co) is the ratio ofthe number of classes including both ck and ci, denoted as co. Fig. 5shows an RDF example using the semantic similarity score using theLin's algorithm and WordNet. As shown in the figure, most of the

semantic similarities are zero for the objects because many ofthe literals in the RDF data are proper nouns not contained in theWordNet database. Since they do not exist in the WordNet, thesimilarity is naturally zero.

Finally, we obtain the semantic similarity (SS) of the ith object,denoted as oi, for a query keyword using the linear combination ofSDS and SSS with a parameter p as seen in Eq. (4).

SSðk; oiÞ ¼ pUSDSðk; oiÞþð1�pÞUSSSðk; oiÞ ðk : keyword;

oi : object in comparisonÞ ð4Þ

where 0opo1. It is challenging to determine an optimal value ofparameter p automatically. So, we determine the optimal p byempirically experience as explained in the Evaluation section.

3.2.2. Keyword query processing with automatic k-depth decisionIn this section, we describe the query processing in detail using

the similarity score example from Section 3.2.1. In Fig. 4, the redarrows representing a “non-related predicate” are removed in theearly stage of the algorithm. For example, “John” has the predicateclass “foaf” and the class of red-arrowed predicates is “gn”.

After we obtain the predicates of the keyword object, the queryprocessor calculates the semantic distance score (SDS) usingEq. (2). First, we calculate the maximum number of subject change(NSC(k, oi)) in Eq. (2) off-line instead of on demand calculation foreach query that damages the query processing performance. Thecalculation process is as follows: (1) we visit each literal entity inthe literal collection to obtain the sub-graph. (2) We then performa basic depth first search (DFS) for each sub-graph recursively andincrement the maximum number of subject change by 1. Forexample, “John” in Fig. 3 has the subject “http://foobar.xx/john” inincoming document from the RDF data collection. We then visiteach of four vertices (“01052325234”, “http://foobar.xx/bob”,“http://sws.geonames.org/3020251/”, “01052325234”) of “http://foobar.xx/john” in turn. Since “01052325234” and “01052325234”are objects, we halt the recursion. However, since “http://foobar.xx/bob” and “http://sws.geonames.org/3020251/” are subjects, weperform additional round of recursion (i.e., find sub-graph andincrement the maximum number of subject change by searching

Fig. 3. The number of subject change (RDF data graph of the example of Table 1).

M. Bae et al. / Neurocomputing 146 (2014) 264–275 269

Page 7: Semantic similarity method for keyword query system on RDF

incoming and outgoing documents of subjects. We eventuallyobtain the maximum number subject change when the recursionhalts.

We then calculate a semantic similarity score (SSS) according toEq. (3) and finally obtain semantic similarity (SS). In the queryprocessing algorithm, we compare this SS value with a thresholdvalue α. If the semantic similarity value is higher than α, the object(i.e., literal value of the vertex) is relevant to the user inquiry and weinclude the triple with the object into the result set. The threshold αis varying depend on the data set, therefore we obtain the thresholdvalue through empirical experiments to find the optimal α.

To explain the pseudo code of algorithm above in detail, we seta threshold and a weight on two scores as α¼0.5 and p¼0.6respectively. For a keyword of user inquiry, “John”, we obtain an IDof “John” out of the literal collection and the maximum numbersubject change value from the memory. “bob” has the samepredicate class with “John”. A semantic distance score (SDS) of“John” and “bob” in Fig. 4 is 1�ððð2�1ÞÞ=3Þ ¼ 2=3 and a semanticsimilarity score (SSS) is 0.3 as seen in Fig. 5. Finally, a semanticsimilarity (SS) is 0:6U2=3þ0:4� 0:3¼ 0:52. Since the SS value islarger than the threshold (α¼0.5), “bob” is determined as “rele-vant” and the triple of “bob” added to the result set.

Fig. 4. Semantic distance score: SDS (RDF data graph of the example of Table 1). (For interpretation of the references to color in this figure, the reader is referred to the webversion of this article.)

Fig. 5. Semantic similarity score: SSS (RDF data graph of the example of Table 1).

M. Bae et al. / Neurocomputing 146 (2014) 264–275270

Page 8: Semantic similarity method for keyword query system on RDF

The following pseudo code describes SearchGraph algorithm,which is iterated recursively until the k-depth reaches 1. Fig. 6shows an illustration of the SearchGraph algorithm running for theRDF data example in Table 1.

Algorithm SearchGraph (vID, callerID)Input: vID, callerID, α(threshold)Data: RDF data collection (table) with incoming &

outgoing listOutput: R containing the triples of auto k-depth sub-graph

of vertex1: If (α4similarity (Vid, callerID)) then2: return result3: Else4: R.add(extractTriples(GetUpList(vID)))

//Retrieve 1-depth graph of incoming edges ofthe vertex

//(except for triples in which: OBJECT is vID,SUBJECT is callerID)

5: R.add(extractTriples(GetDownList(vID)))//Retrieve 1-depth graph of outgoing edges of

the vertex//(except for triples in which: SUBJECT is

calledID, OBJECT is vID)6: For each subject in vID.upList7: R.add(searchGraph (subject.ID, vID))

//Add (k�1)-depth graph of adjacentvertices of given vertex

8: For each object in vID.downList9: R.add(searchGraph(object.ID, vID))

//Add (k�1)-depth graph ofadjacent vertices of given vertex.

4. Evaluation

In this section we present our study on the effectiveness andusability of the approach. We performed two experiments, one on

the method to obtain the optimal weight p between a semanticdistance score (SDS) and a semantic similarity score (SSS) of thelinear combination, Eq. (4), and the other to focus on the comparisonagainst the popular RDF management system, Jena [29]. The firstexperiment is for evaluating the usability of our approach. For givendata sets, we use different values of p to obtain the optimal value forp by observing the precision, recall, and accuracy values. The secondexperiment is for evaluating the effectiveness and efficiency of ourapproach by comparing ours with the Jena supporting SPARQL interms of query response time and accuracy. The empirical resultsshow that a system based on our approach outperforms Jena system.

4.1. Evaluation environment setup

4.1.1. SystemsFor our evaluation, we use two physical server nodes to host

MongoDB [33] as a key-value based distributed storage. We alsoutilize its sharding feature to store RDFs as distributed. We haveone master node for indexing, querying, and storing and oneadditional node for data distribution (i.e. sharding). Table 5 showsthe environment setup for evaluation.

4.1.2. DatasetsWe used a RDF data of DBpedia [34], one of the Linked Data

projects that contains resources of Wikipedia project, and WordNet3.0 for a keyword query processing experiments. We generated

Fig. 6. Illustration of SearchGraph algorithm running (RDF data graph of the example of Table 1).

Table 5Simulation environment setup.

Main Server Node

Server name Ritchie MAGICPU clock 1.6 GHz 2.0 GHzRAM 8 GB 4 GBNumber of core 2 1OS CentOS 5.6 CentOS 5.6Development Java 1.7.0 Java 1.7.0Database MongoDB MongoDB

M. Bae et al. / Neurocomputing 146 (2014) 264–275 271

Page 9: Semantic similarity method for keyword query system on RDF

3 datasets of different sizes (i.e., 1 MB, 5 MB, and 10 MB) based on theDBpedia data sets. The numbers of triples in datasets are 8k, 16k, and29k respectively. The sizes of these datasets are relatively smallerthan keyword search evaluations of other research papers [6,15].However, we argue that the sizes are big enough for our purpose ofevaluations; validating the feasibility of our semantic similarityequation (i.e., Eq. (4)) and effectiveness of our automatic search

depth decision algorithm that returns exact matches and relevantinformation. (i.e., triples).

4.2. Obtaining an optimal p

In our semantic similarity equation, the value p is a weightbetween the semantic similarity score and the semantic distance

Fig. 7. Precision, recall, and accuracy by varying p.

Fig. 8. F-measure of precision and recall in Fig. 7.

Fig. 9. Comparison of query processing performance with Jena (with a depth one, exact match only).

M. Bae et al. / Neurocomputing 146 (2014) 264–275272

Page 10: Semantic similarity method for keyword query system on RDF

score. p Varies depending on the data set, since each RDF data sethas a unique graph structure and set of triples in the data set.Therefore obtaining the optimal p is a critical to determine thedepth of the search. As noted above, the semantic similarity scoreof proper nouns with the query keyword is zero if we compare anobject with a word out of the WordNet. With the optimal p, webalance the score of structure (i.e., relational) strength and thescore of semantic strength. In this section, we present a method toobtain the optimal p value.

For this experiment, we prepared the small data sets (i.e., sub-graph) based on DBpedia data, which is similar with the examplein Table 1 in terms of a size and structure. We set up the thresholdas 0.4 (α¼0.4). The precision, recall, and accuracy data are shownin Fig. 7. The precision and recall is more important for informa-tion retrieval in RDF data, since the ratio of irrelevant data torelevant data gets increased (i.e., more of irrelevant data) as thenumber of triples is increased (i.e., big RDF data). Therefore wefind the optimal p using an F-measure in Fig. 8, which is theharmonic mean of precision and recall. As shown in Fig. 8, we have0.7 as the optimal value of p for given data sets.

4.3. Keyword query performance

We evaluate the keyword query performance of our proposedsystem. We first compare two RDF management systems, Jena (i.e.,a popular RDF management system) and our proposed system,to evaluate the performance and usability. We examine the

effectiveness of our approach by comparing the response time ofJena with ours. Since Jena with SPARQL supports an only exact-match result and does not have the k-depth capability thatautomatically determines search depth of relevant information (i.e., not specialized for a keyword query), we were limited to planexperiments for various k-depths. Thus, we conduct the experi-ments by setting a search depth of two systems as one (i.e.,k-depth¼1, the exact match only) and two (i.e., Jena usesDESCRIBE queries) for this experiments. We varied the RDF dataset size from 1 MByte to 100 MByte. The query response time areshown in Figs. 9 and 10 for k-depth one and two respectively. Weset a threshold as 0.4 (α¼0.4) and p as 0.7 that is obtained fromSection 4.2. In this experiments, we prepared a set of 10 keywordsfor each test and iterate 10 times to get the average response time.We used the identical keyword set for our system and Jena.

We observe that our system performs better in order of magni-tude (i.e., 3–45 times better for given datasets). The query processingtime increases linearly as the size of data increases regardless of k-depth. Also, it takes more time to process query as we increase thesearch depth (i.e., k-depth). However, our automatic search depth

Fig. 10. Comparison of query processing performance with Jena (with a depth two).

Fig. 11. Comparison of query processing performance (with an automatic depth).

Table 6Maximum search depth (k-depth) of queries in Fig. 11.

Data size (MByte) 1 10 30 50 70 100Maximum search depth 2 4 4 6 7 7

M. Bae et al. / Neurocomputing 146 (2014) 264–275 273

Page 11: Semantic similarity method for keyword query system on RDF

design prevents that the query processing time be the optimaldepending on the semantic distance and semantic similarity insteadof consuming much time to return little relevant data. Thus, thisprocessing time increase can be minimized. The query response timeof Jena is not changed much from k-depth one to k-depth two. Thisresult of Jena may be viewed as the performance of Jena is notvarying much in the increment of search depth. However, analyzingJena's results in terms of the search depth is not adequate, since Jenadoes not have an automatic search depth decision capability. TheJena system returns simply indexed objects of the given subject tothe DESCRIBE query that we used to imitate the k-depth two. Wepresent the query performance results of Jena with k-depth two justto provide the guidance.

The main design factor for this difference is the existence ofcollections, which is an indexed structure for keyword search inparticular. Although our approach requires a preprocessing ofindexing, the preprocessing is done in off-line. Our system clearlyoutperforms the conventional system. In addition, as presented inour previous work [12], our approach is not limited to a keywordquery and applicable to a formal query such as SPARQL and oursystem gives competitive performance for formal queries either.

Our next experiment is for evaluating the main contribution ofthis paper, the automatic decision of search depth in terms ofquery response performance. In this experiment, we evaluate theperformance of our proposed system with automatic search depthdecision capability and the results are shown in Fig. 11. In thisexperiment, our system automatically decides the search depthbased on the similarity measures. Unlike the conventional RDFdata management systems, our system is designed to return asmuch relevant data as possible. Table 6 shows the maximumsearch depth (k-depth) of queries in Fig. 11. By observing perfor-mance data of Jena in Figs. 9 and 10 and the performance of oursystem with automatic search depth decision capability in Fig. 11,we analyze that the calculation overhead of the automatic searchdepth decision is marginal. The overhead is 29% of processing timein the worst case with four more depth search (i.e., fixed depthtwo of Jena vs. automatically decided depth six of our system). Forthe small size datasets, our system is performs 50% better evenwith equal or more depth of search. Thus we claim that ourapproach provides more accurate and better quality of results tousers with marginal sacrifice of response time without any helpfrom users (i.e., automatic decision of search depth) instead ofreturning an only exact match result.

The solution for improving the query processing performancelooks apparent. Since the similarity calculation time using WordNettakes 49–72% of a total query processing time as seen in Table 7, anovel efficient algorithm for similarity calculation using WordNetwill make an overall query processing time with additional relevantinformation more competitive to simple exact matching.

5. Conclusion

We propose an effective and scalable solution for search depth-determining problem of keyword query on a RDF data set. In our

novel approach supports, the semantic relation, the predicateplays major role and we calculate similarities from two differentaspects of structural relation strength (i.e., distance score) andsemantic strength using WordNet. Although there are existingapproaches that considering semantic characteristics of RDF, wordsimilarity cannot be used effectively (or practically) in theirapproaches. Most of objects (i.e., literal) are proper noun. There-fore the distance score of our approach makes difference inpractical management of a RDF data. Besides usability (i.e.,returned results include the exact match and relevant triples withthe optimal number of search depth), our approach also haspotential for high-performance query processing and supportinga massive-scale RDF data (i.e., efficient key-value based distributedindexing and storage). Results of empirical experiments show thatour approach performs better in order of magnitude (3–45 timesw.r.t. given RDF data sets) than the popular RDF managementsystem for the exact match (i.e., k-depth¼1, the exact match only)and depth two (i.e., k-depth¼2, using DESCRIBE query) in terms ofquery response time. Experiment results of an automatic searchdepth decision shows that our approach performs 29% slower inthe worst case than the conventional exact match approach.However, it performs 50% better for small size data sets (10–30 M Byte) even with deeper search depths (two vs. four). Theperformance sacrifice is marginal and more importantly, ourapproach provides relevant information (i.e., triples) with theoptimal search depth.

In the future, we will consider a new method for a semanticsimilarity calculation using WordNet to improve a query proces-sing performance, since current methods are heavily rely on graphtraverse algorithm and the WordNet processing time is a majorperformance eater. For instance, we can adopt other lexicalresource for word comparing and build a novel algorithm.

Acknowledgment

This research was supported by the MSIP (Ministry of Science,ICT & Future Planning), Korea, under the ITRC (InformationTechnology Research Center) support program supervised by theNIPA (National IT Industry Promotion Agency (NIPA-2014-(H0301-14-1011))).

References

[1] C. Weiss, P. Karras, A. Bernstein, Hexastore: sextuple indexing for semanticweb data management, in: VLDB, Auckland, New Zealand, 2008.

[2] O. Erling, I. Mikhailov, Virtuoso: RDF support in a native RDBMS, in: SemanticWeb Information Management, Berkeley, CA, USA, 2009.

[3] K. Rohlo, R.E. Schantz, High-performance, massively scalable distributedsystems using the MapReduce software framework: the SHARD triple-store,in: PSI EtA, Reno/Tahoe, NV, USA, 2010.

[4] S. Sakr, G. Al-Naymat, Relational processing of RDF queries: a survey, ACMSIGMOD Rec. 38 (2009) 23–28 ⟨http://www.sigmod.org/publications/sigmod-record/0912⟩.

[5] K. Zeng, J. Yang, H. Wang, B. Shao, Z. Wang, A distributed graph engine for webscale RDF data, in: VLDB, Trento, Italy, 2013.

Table 7Total query processing time vs. WordNet processing time (in ms).

Tasks Data size (MByte)

1 10 30 50 70 100

Total query processing time 2034.3 4757.2 8412.4 26,341.3 35,266.2 38,124.2WordNet processing time 1322.3 3341.4 5623.5 14,123.2 17,285.4 19,617.3

(65.0%) (70.2%) (66.8%) (53.6%) (49.0%) (51.5%)

M. Bae et al. / Neurocomputing 146 (2014) 264–275274

Page 12: Semantic similarity method for keyword query system on RDF

[6] H. He, H. Wang, J. Yang, P.S. Yu, BLINKS: ranked keyword searches on graphs,in: Proceedings of the ACM SIGMOD International Conference on Managementof Data, Beijing, China, 2007.

[7] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,in: WWW, Brisbane, Australia, 1997.

[8] L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, XRANK: ranked keywordsearch over XML documents, in: SIGMOD, San Diego, CA, USA, 2003.

[9] D. Zhang, T. Song, J. He, X. Shi, Y. Dong, A similarity-oriented RDF graphmatching algorithm for ranking linked data, in: Computer and InformationTechnology, Chengdu, China, 2012.

[10] T. Neumann, G. Weikum, The RDF-3X engine for scalable management of RDFdata, VLDB J. 19 (1) (2010) 91–113.

[11] G.A. Miller, WordNet: a lexical database for English, Commun. ACM 38 (11)(1995) 39–41.

[12] M. Bae, D. Nguyen, S. Kang, S. Oh, K-depth RDF keyword search algorithmbased on structure indexing, Adv. Methods Technol. Agent Multi-Agent Sys.(2013) 252–346.

[13] S. Elbassuoni, R. Blanco, Keyword search over RDF graphs, in: ACM Informa-tion and Knowledge Management, Glasgow, Scotland, UK, 2011.

[14] S. Wernicke, A faster algorithm for detecting network motifs, AlgorithmsBioinform. 3692 (2005) 165–177.

[15] T. Tran, H. Wang, S. Rudolph, P. Cimiano, Top-k exploration of query candidatesfor efficient keyword search on graph-shaped (RDF) data, in: Proceedings ofthe International Conference on Data Engineering, Shanghai, China, 2009.

[16] I. Ilyas, G. Beskales, M. Soliman, A survey of top-k query processing techniquesin relational database systems, ACM Comput. Surv. 40 (4) (2008) 1–58.

[17] In the semantic Web: research and applications, Heraklion, Crete, Greece,2012.

[18] Z. Wu, M. Palmer, Verb semantics and lexical selection, in: Proceedings of the32nd Annual Meeting of the Association for Computational Linguistics, NewMexico State University, Las Cruces, New Mexico, 1994, pp. 133–138.

[19] P. Resnik, Using information content to evaluate semantic similarity in ataxonomy, IJCAI 1 (1995) 448–453.

[20] P. Resnik, Semantic similarity in a taxonomy: an information-based measureand its application to problems of ambiguity in natural language, J. Artif. Intell.Res. 11 (1999) 95–130.

[21] J.J. Jiang, D.W. Conrath, Semantic similarity based on corpus statistics andlexical taxonomy, in: Research on Computational Linguistics (ROCLING X),Taiwan, 1997.

[22] C. Leacock, M. Chodorow, Combining local context and ordNet similarity forword sense identification. An Electronic Lexical Database, 1998, pp. 265–283.

[23] G. Hirst, D. St-Onge, Lexical chains as representation of context for thedetection and correction malapropisms, 1997.

[24] D. Lin, An information-theoretic definition of similarity, in: Machine Learning,San Francisco, CA, 1998, pp. 296–304.

[25] T. Pedersen, WordNet::similarity ⟨http://www.d.umn.edu/�tpederse/similarity.html⟩.

[26] Semantic Similarity Systems, Intelligent Systems Lab., Technical University ofCrete ⟨http://www.intelligence.tuc.gr/similarity/⟩.

[27] SimPack Project ⟨http://www.ifi.unizh.ch/ddis/simpack.html⟩.[28] A. Termehchy, M. Winslett, Keyword search over key-value stores, in: WWW,

Raleigh, NC, USA, 2010.[29] B. McBride, Jena: implementing the RDF model and syntax specification, in:

Proceedings of the Semantic web workshop, Hong Kong, China, 2001.[30] FOAF Vocabulary Specification 0.98, 2010 ⟨http://xmlns.com/foaf/spec/20100809.

html⟩.

[31] J. Breslin, A. Harth, U. Bojars, S. Decker, Towards semantically-interlinkedonline communities, the semantic web: research and applications, Lect. NotesComput. Sci. 3532 (2005) 500–514.

[32] F. Lin, K. Sandkuhl, A survey of exploiting WordNet in ontology matching, in:Proceedings of the World Computer Congress, TC 12: IFIP AI 2008 Stream,Milano, Italy, September 7–10, 2008.

[33] MongoDB ⟨http://www.mongodb.org/⟩.[34] DBpedia ⟨http://dbpedia.org/Datasets⟩.

Minho Bae received his B.S. in Computer Science fromthe Ajou University in 2012. Currently, he is a Ph.D.student in Computer Science at the Ajou University. Hisresearch interests include semantic web, RDF, cloudcomputing and web service.

Sanggil Kang received the M.S. and Ph.D. degrees inElectrical Engineering from the Columbia Universityand Syracuse University, USA in 1995 and 2002, respec-tively. He is currently an Associate Professor in theDepartment of Computer Science and InformationEngineering at the INHA University, Korea. His researchinterests include semantic web, artificial intelligence,multimedia systems, inference systems, etc.

Sangyoon Oh received his Ph.D. in Computer Sciencefrom the Indiana University, Bloomington (USA) in2006. He is currently an Associate Professor in theDepartment of Information and Computer Engineeringat the Ajou University, South Korea. Before joining AjouUniversity, he worked for SK Telecom, South Korea. Hisresearch interests include data management forsemantic web, distributed science collaborations, andcloud computing.

M. Bae et al. / Neurocomputing 146 (2014) 264–275 275