Lebanese American University School of Engineering
Electrical & Computer Engineering Department
COE594 Undergraduate Research Project
Final Report
A Prototype for Semantic Querying and
Evaluation of Textual Data
Christian G. Kallas
Supervisor: Dr. Joe Tekli
Byblos, April 2017
COE594 Undergraduate Research Project – Final Report
2
Abstract This undergraduate research project addresses the problem of semantic-aware search in textual SQL
databases. This problem has emerged as a required extension of the standard containment keyword-
based query to meet user needs in textual databases and IR applications. We build on top of a semantic-
aware inverted index called SemIndex, previously developed by our peers, to allow semantic-aware
search, result selection, and result ranking functionality. Various weighting functions and search
algorithms have been developed for that purpose. An easy to use graphical interface was also designed to
help testers write and execute their queries. Preliminary experiments reflect the effectiveness and
efficiency of our solution.
COE594 Undergraduate Research Project – Final Report
3
Table of Contents Abstract ......................................................................................................................................................... 2
1 Introduction ............................................................................................................................................... 6
1.1 Context .......................................................................................................................................... 6
1.2 Goal and Contributions ................................................................................................................. 6
2 Background ................................................................................................................................................ 7
2.1 Literature Overview ............................................................................................................................ 7
2.2 SemIndex Logical Model ..................................................................................................................... 8
2.3 SemIndex Physical Implementation .................................................................................................. 10
3 SemIndex Weight Functions .................................................................................................................... 11
3.1 Index Node Weight ........................................................................................................................... 11
3.2 Index Edge Weight ............................................................................................................................ 11
3.3 Data Node Weight ............................................................................................................................ 12
3.4 Data Edge Weight ............................................................................................................................. 12
3.5 Relevance Scoring Measure .............................................................................................................. 13
4 Querying Algorithms ................................................................................................................................ 13
4.1 Generic Solutions .............................................................................................................................. 13
4.1.1 Legacy Inverted Index Search .................................................................................................... 13
4.1.2 Query Relaxation ........................................................................................................................ 14
4.1.3 Query Disambiguation ............................................................................................................... 14
4.2 SemIndex Querying Algorithms ........................................................................................................ 14
4.2.1 SemIndex Search ........................................................................................................................ 14
4.2.2 SemIndex Disambiguated .......................................................................................................... 15
4.2.3 SemIndex Query As You Type .................................................................................................... 15
5 Prototype Functionality ........................................................................................................................... 15
5.1 Interface manipulation and customization ....................................................................................... 15
5.2 Range and KNN Search Operators .................................................................................................... 18
5.3 Relevant Document Generator ......................................................................................................... 18
6 Experimental Evaluation .......................................................................................................................... 20
6.1 Experimental Metrics ........................................................................................................................ 20
6.1.1 Quality Precision, Recall and F-Value ......................................................................................... 20
6.1.2 Ranking: Mean Average Precision.............................................................................................. 20
6.1.3 Ordering: using Kendall’s Tau .................................................................................................... 21
COE594 Undergraduate Research Project – Final Report
4
6.2 Preliminary Results ........................................................................................................................... 23
6.2.1 Querying Effectiveness .............................................................................................................. 24
6.2.2 Querying Efficiency .................................................................................................................... 27
7 Conclusion ................................................................................................................................................ 29
7.1 Summary of Contribution ................................................................................................................. 29
7.2 Ongoing Works .................................................................................................................................. 29
COE594 Undergraduate Research Project – Final Report
5
List of Figures
Figure 1- SemIndex Framework (from [7]) .................................................................................................... 6
Figure 2 Sample inverted index (a) and corresponding SemIndex graph (b), based on the textual
collection Δ in Table 1. .................................................................................................................................. 8
Figure 3 - Extract from the knowledge graph of WordNet, with corresponding inverted index. ................ 9
Figure 4 SemIndex graph obtained after coupling the data collection and the knowledge base graphs in
Figure 5 SemIndex physical design. ............................................................................................................ 10
Figure 6 Sample node linkage in SemIndex graph ...................................................................................... 13
Figure 9 - Query as you go meaning chooser panel .................................................................................... 18
Figure 10 - Relevant Document Generator ................................................................................................. 19
Figure 11 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy
InvIndex, with query group Q1 (unrelated queries), varying the number of query terms k and link
................................................................................................. 26
Figure 12 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy
InvIndex, with query group Q2 (expanded queries), varying the number of query terms k and link
................................................................................................. 26
Figure 13 Comparing f-value levels obtained using SemIndex versus legacy InvIndex, with query group
Q1 (unrelated queries), and query group Q2 (expanded queries) ............................................................. 26
Figure 14 Comparison with legacy InvIndex query execution time, considering queries of group Q1
(unrelated queries), .................................................................................................................................... 28
Figure 15 Comparing SemIndex and legacy InvIndex query execution time, considering queries of group
Q1 (unrelated queries), wh
number of query terms k. ........................................................................................................................... 28
List of Tables
Table 1. Sample Movie data collection extracted from IMDB , titled Δ. ...................................................... 8
Table 2 - Test queries .................................................................................................................................. 23
COE594 Undergraduate Research Project – Final Report
6
1 Introduction
1.1 Context Processing keyword-based queries is a fundamental problem in the domain of Information
Retrieval (IR) [12, 15, 16]. . A standard containment keyword-based query, which retrieves textual
identities that contain a set of keywords, is generally supported by a full-text index. Inverted
index is considered as one of the most useful full-text indexing techniques for very large textual
collections [16], supported by many relational DBMSs, and then extended toward semi-structured
and unstructured data to support keyword-based queries.
In a previous study [7], a group of colleagues developed SemIndex: a semantic-aware inverted
index model designed to process semantic-aware queries. An extended query model with
different levels of semantic awareness was defined, so that both semantic-aware queries and
standard containment queries can be processed within the same framework. Figure 1 illustrates
the overall framework of the SemIndex approach and its main components. Briefly, the Indexer
manages the index generation and maintenance, while the Query Processor processes and
answers semantic-aware (or standard) queries issued by the user using SemIndex component.
Figure 1- SemIndex Framework (from [7])
1.2 Goal and Contributions While the study in [7] focused on the index design model of SemIndex, the goal of our current
project is to complete the design and development of SemIndex’s query evaluation engine. To
this end, our contributions include: i) a dedicated weight functions, associated with the different
elements of SemIndex, allowing semantic query result selection and ranking, coupled with iii) a
dedicated relevance scoring measure, required in the query evaluation process in order to
retrieve and rank relevant query answers, iii) three alternative query processing algorithms (in
addition to the main algorithm), iv) a dedicated GUI interface allowing user to easily manipulate
the prototype system, and iv) an experimental study valuating SemIndex’s querying effectiveness
and efficiency.
COE594 Undergraduate Research Project – Final Report
7
2 Background
2.1 Literature Overview Early approaches on keyword search queries for RDBs uses traditional IR scores (e.g., TF-IDF) to
find ways to join tuples from different tables in order to answer a given keyword query [1, 14,
20]. The proposed search algorithms focus on enumeration of join networks called candidate
networks, to connect relevant tuples by joining different relational tables. The result for a given
query comes down to a sequence of candidate networks, each made of a set of tuples containing
the query keywords in their text attributes, and connected through their primary-foreign key
references, ranked based on candidate network size and coverage. More recent methods on RDB
full-text search in [24, 25] focus on more meaningful scoring functions and generation of top-k
candidate networks of tuples, allowing to group and/or expand candidate networks based on
certain weighting functions in order to produce more relevant results. The authors in [27] tackle
the issue of keyword search on streams of relational data, whereas the approach in [39]
introduces keyword search for RDBs with star-schemas found in OLAP applications. Other
approaches introduced natural language interfaces providing alternate access to a RDB using
text-to-SQL transformations [22, 33], or extracting structured information (e.g., identifying
entities) from text (e.g., Web documents) and storing it in a DBMS to simplify querying [9, 10].
Keyword-based search for other data models, such as XML [17, 19] and RDF [3, 4] have also been
studied.
More recent approaches have developing so-called semantic-aware or knowledge-aware
(keyword) query systems, which have emerged since the past decade as a natural extension to
traditional containment queries, encouraged by (non-expert) user demands. Most existing works
in this area have incorporated semantic knowledge at the query processing level, to: i) pre-
process queries using query rewriting/relaxation and query expansion [5, 11, 28], ii) disambiguate
queries using semantic disambiguation and entity recognition techniques [5, 23, 32], and/or iii)
post-process query results using semantic result organization and re-ranking [32, 35, 38]. Yet,
various challenges remain unsolved, namely: i) time latencies when involving query pre-
processing and post-processing [11, 28], ii) complexity of query rewriting/relaxation and query
disambiguation requiring context information (e.g., user profiles or query logs) which is not
always available [13, 37], and iii) limited user involvement, where the user is usually constrained
to providing feedback and/or performing query refinement after the first round of results has
been provided by the system [6, 31].
Our work is complementary to most existing DB search algorithms in that our approach
extends syntactic keyword-term matching: where only tuples containing exact occurrences of the
query keywords are identified as results, toward semantic based keyword matching: where tuples
containing terms which are lexically and semantically related to query terms are also identified as
potential results, a functionality which - to our knowledge - remains unaddressed in most existing
DB search algorithms. We build on an adapted index structure able to integrate and extend
textual information with domain knowledge (not only at the querying level, but rather) at the
COE594 Undergraduate Research Project – Final Report
8
most basic data indexing level, providing a semantic-aware inverted index capable of supporting
semantic-based querying, and allowing to answer most challenges identified above.
2.2 SemIndex Logical Model
SemIndex logical design consists of a semantic knowledge graph structure, which consists of
combining two graph representations, for each of the input resources: i) textual collection (e.g.,
IMDB) and ii) reference knowledge base (WordNet) into a single SemIndex graph structure.
Consider for instance the sample data collection in Table 1.
Table 1. Sample Movie data collection extracted from IMDB , titled Δ.
ID Textual content
O1 When a Stranger Calls (2006): A young high school student babysits for a very rich family.
She begins to receive strange phone calls threatening the children...
O2 Days of Thunder (1990): Cole Trickle is a young racer from California with years of
experience in open-wheel racing winning championships in Sprint car racing…
O3 Sound of Music, The (1965): Maria had longed to be a nun since she was a young girl, yet
when she became old enough discovered that it wasn’t at all what she thought...
Term Object IDs[ ]
“car” O1, O2
“light” O1
“sound” O3
“zen” O1
… …
a. Inverted index InvIndex(Δ).
b. SemIndex graph representing InvIndex(Δ).
Figure 2 Sample inverted index (a) and corresponding SemIndex graph (b), based on the textual collection Δ in elbaT 1.
Error! Reference source not found. shows an extract from an inverted index built on the sample
movie database in Error! Reference source not found., where data objects O1, O2, and O3 have
been indexed using index terms extracted from the database, sorted in alphabetic order. It is
important to note that this simple index is typically used to answer containment queries [41],
aiming at finding data objects that contain one or more terms. When a keyword query mapping
two or more index terms must be processed, the corresponding posting lists are read and
merged. The index terms and their mappings with the data objects can be generated using
classical Natural Language Processing (NLP) techniques (including stemming, lemmatization, and
stop-words removal) [30], which could be either embedded in the DBMS or supplied by a third-
party provider.
Data node Index node Contained relationship
COE594 Undergraduate Research Project – Final Report
9
An extract from the WordNet ontology is shown in Error! Reference source not found., where S1,
S2 and S3 represent senses (i.e., synsets), and their string values (i.e., the synsets’
glosses/definitions), and T1, T2, …, T11 represent terms, and their string values (i.e., literal
words/expressions) shown alongside the nodes.
a. Sample graph representing a KB extract from WordNet.
Term Sense IDs[]
“acid” S1, S3
“clean” S2
“light” S2
“lsd” S3
“lysergic” S1, S3
… …
b. Extract of inverted index
InvIndex(GKB) connecting terms in
GKB with corresponding senses (to
speed up term/synset lookup)
Figure 3 - Extract from the knowledge graph of WordNet, with corresponding inverted index.
A sample SemIndex graph is shown in Fig. 4 built based on the textual collection Δ from Error! Reference source not found. and the KB extract in Error! Reference source not found.. It comprises 3 data nodes (O1 – O3), 3 index sense nodes (S1 – S3), and 11 index term nodes (T1 – T11) along with data and index edges.
a. SemIndex graph before removing edge labels and string values.
b. Final SemIndex graph representation.
Figure 4 SemIndex graph obtained after coupling the data collection and the knowledge base graphs in Figure 3.
COE594 Undergraduate Research Project – Final Report
10
2.3 SemIndex Physical Implementation Fig. 5.a shows the ER conceptual diagram of the SemIndex data graph, whereas Fig.5.b depicts
the data coverage of each relation in the resulting RDB schema. The relations are described
separately in the following subsections.
2.3.1. Data Index
The DDL1 statements for creating relation DataIndex is shown below:
CREATE TABLE DataIndex(objectid INT PRIMARY KEY, weight DECIMAL);
Relation DataIndex stores (in attribute DataIndex.objectid) the identifiers of data objects from the
data collection, represented as data nodes in SemIndex (cf. Fig. 5) along with data node weights
(e.g., an object rank score, stored in attribute DataIndex.weight, which is computed and then
updated during query processing, cf. Section 3).
2.3.2. Lexicon
The DDL statement for creating the Lexicon relation is shown below:
CREATE TABLE Lexicon (nodeid INT PRIMARY KEY, value VARCHAR, weight DECIMAL);
Relation Lexicon stores the lexicon of the knowledge base (KB) used to index the data collection
(Δ), i.e. the set of all index nodes in SemIndex, (cf. Fig. 5). Searchable term nodes, are stored in
their lemmatized form (in attribute Lexicon.value) along with corresponding node identifiers (e.g.,
WordNet node identifiers, stored in Lexicon.nodeid), whereas sense nodes, have null values (in
attribute Lexicon.value). Index node weights are then computed and dynamically updated during
query processing, stored in attribute Lexicon.weight.
a. Conceptual ER model describing SemIndex
physical design.
b. Data representation of each relation in the resulting RDB schema.
Figure 5 SemIndex physical design.
1 Data Definition Language.
Entity Relationship Attribute
DataIndex
Lexicon
Posting List
Neighbors
COE594 Undergraduate Research Project – Final Report
11
2.3.3. Posting List
The DDL statement for creating the PostingList relation is shown below:
CREATE TABLE PostingList(nodeid INT, objectid INT, weight DECIMAL, PRIMARY KEY(nodeid, objectid) );
Relation PostingList stores the inverted index of the textual collection (Δ), which comes down to
data edges in SemIndex. PostingList results from joining relations Lexicon with DataIndex, linking
data nodes (PostingList.objectid) with corresponding searchable term nodes
(PostingList.nodeid), each with its corresponding data edge weight (e.g., term frequency score).
PostingList is clustered on attribute nodeid, and for each nodeid, the posting list is sorted on
objectid to optimize search time.
2.3.4. Concepts/Terms Links
The DDL statement for creating the links between terms and concepts, in a Neighbors relation, is shown below:
CREATE TABLE Neighbors(id INT PRIMARY KEY, node1id INT, node2id INT, relationship VARCHAR, weight DECIMAL);
Relation Neighbors stores all index edges in SemIndex linking index nodes (stored as pairs of index
node identifiers in attributes Neighbors.node1id and Neighbors.node2id), including: term-to-
term, term-to-sense and sense-to-sense relationships, along with index edge labels (stored in
Neighbors.relationship) and corresponding index edge weights (stored in Neighbors.weight).
When using WordNet, the label of the relationship includes 28 possible lexical/semantic
relationship types (e.g., hypernym, hyponym, meronym, related-to, etc.), as well as the has-
sense/has-term introduced to explore WordNet term nodes relations. An id attribute was added
since several edges can exist between two index nodes.
3 SemIndex Weight Functions
3.1 Index Node Weight The index node weight is basically computed on the index nodes directly. It is computed
according to the below formula where we consider ‘Fan -in’ to be the number of incoming nodes.
Regardless of each incoming nodes’ respective importance, in determining the importance of
index node vi.
WIndex Node =
( ) [0,1]
( ( ))j index
i
jv V
Fan in v
Max Fan in v
(1)
The rational here becomes: a node is more important if it receive more links from other nodes.
3.2 Index Edge Weight
Given an index edge jie incoming from index node vi and outgoing toward index node vj, we can
define:
COE594 Undergraduate Research Project – Final Report
12
WIndex Edge (j
ie )= 1 ]0,1]( )Label iFan out v
(2)
Thus the weight increases with the number of outgoing links from a certain index node to
another, taking into account the semantic relation type describing the index link at hand.
3.3 Data Node Weight Data node weight is defined as follows:
WData Node (ni) =
SemIndex
( ) [0,1]
Max( ( G ))i
j
Fan - In n
Fan - In n
(3)
Where Fan-In(ni) designates the number of foreign key/primary key data links (joins) outgoing
from data nodes (tuple) where the foreign keys reside, toward data node (tuple) ni where the
primary key resides. The rational followed is a PageRank-like score counting the number and
quality of FK/PK links to a data node to determine a rough estimate of how important the data
node is. The underlying assumption is that more important data nodes are likely to receive more
FK/PK links from other data nodes.
3.4 Data Edge Weight TF-IDF score where TF underlines the frequency (number of occurrences) of the index node label
within a given data node, connected via the data edge in question, and IDF underlines the
number of data edges connecting the same index node to data nodes, i.e., the fan-out of the
index node in question. Given a data edge jie incoming from index node vi toward data node
nj, we defined:
WData Edge (j
ie )= TF(vi.l, nj) IDF(vi.l) (4)
Where TF is calculated according to the below function:
TF(vi.l, nj) = i
k kv e
NbOcc(v .l) [0, 1]
( NbOcc(v .l))ik
Max
(5)
N being the total number of data nodes in GSemIndex and DF(vi.l, GSemIndex) is the number of
data nodes in GSemIndex containing at least one occurrence of of vi. In order to have an IDF
[0, 1], we can redifine the formula as follows:
IDF(vi.l, GSemIndex) = ( . , )
1 [0, 1[i SemIndexDF v G
N
(6)
COE594 Undergraduate Research Project – Final Report
13
3.5 Relevance Scoring Measure The scores of data nodes returned as query answers are computed on typical Diskstra-style
shortest distance computations. Yet, instead of identifying the shortest (smallest) distance, we
will be identifying the maximum similarity (similarity being the inverse function of distance).
𝑊𝑒𝑖𝑔ℎ𝑡(𝑛𝑖, 𝑛𝑙) = {
𝑊𝐷𝑎𝑡𝑎𝑁𝑜𝑑𝑒(𝑛𝑙) 𝑥 WDataEdge(𝑒𝑘
𝑙 ) 𝑥 1
𝑑(𝑛𝑘,𝑛𝑙)
+ 𝑊𝐼𝑛𝑑𝑒𝑥𝑁𝑜𝑑𝑒(𝑛𝑘) 𝑥 𝑊𝐼𝑛𝑑𝑒𝑥𝐸𝑑𝑔𝑒(𝑒𝑗𝑘) 𝑥 1
𝑑(𝑛𝑘,𝑛𝑙)
+ 𝑊𝐼𝑛𝑑𝑒𝑥𝑁𝑜𝑑𝑒(𝑛𝑗) 𝑥 𝑊𝐼𝑛𝑑𝑒𝑥𝐸𝑑𝑔𝑒(𝑒𝑖𝑗) 𝑥 1
𝑑(𝑛𝑖,𝑛𝑙)𝑑(𝑛𝑖, 𝑛𝑙)
⁄
(7)
Where d(ni, nj) is the distance in number of edges between two nodes. In other words, in the
following example, d(nk, n ) = 1, d(nj, n ) = 2, and d(ni, n ) = 3
Figure 6 Sample node linkage in SemIndex graph
The previous example consists of a single term query (i.e., starting from a single starting index
node ni). When processing a multi-keyword query (starting from multiple starting index nodes: ni,
no, np, etc.), the data node will be assigned a weight vector of the form: <weight(ni, n ), weight(no,
n ), weight(np, n )>.
To get single weight score for the data node considering all starting index index, we can use the
average function to combine the individual weights.
, } n { ,
score ( ( , ))d l
o pin n n
n average weight n n
(8)
4 Querying Algorithms We considered different querying algorithms in our solution: i) three generic solutions adapted from the
literature (Legacy Inverted Index search, Query Relaxation and Query Disambiguation), and ii) two
solutions we designed specifically for SemIndex (SemIndex Search, SemIndex Disambiguated, and Query
As You Type). We describe the different algorithms in the following subsections.
4.1 Generic Solutions 4.1.1 Legacy Inverted Index Search
It's a standard containment keyword-based query that:
COE594 Undergraduate Research Project – Final Report
14
1- Retrieves textual identities that contain a set of keywords
2- Queries the inverted (term, objectIDs[]) list with every term in the query, to indentify
objects IDs associated to all (or at least one) query terms (based on the query type at
hand: conjunctive or disjunctive)
3- Assigns a score to every potential query answer (data object) considering a predefined
relevance ordering scheme, in order to return the results ranked by their order of scores
in ascending order.
Legacy inverted index search is supported by many relational DBMSs, and has been extended to
support keyword-based queries on semi-structured and unstructured data
4.1.2 Query Relaxation Algorithms Steps:
1- Perform Part-Of-Speech tagging
2- Select most common sense for each token (based on WordNet’s usage frequency,
computed based on the Brown text corpus)
3- For each selected sense si, include in the query: synonymous terms in the synset that is
si, as well as the synonymous terms of all senses included in the direct semantic context
of si, i.e., senses that are related to si via a semantic relationship (e.g., hypernymy,
hyponymy, meronymy, etc.)
4- Run the resulting query as a traditional keyword containment query on the data
collection’s inverted index (syntactic processing)
4.1.3 Query Disambiguation
Algorithms Steps:
1- Perform WSD on query terms using the Simplified (Adapted)
2- For each of the identified senses, include in the query: synonymous terms\
3- Run the resulting query as a traditional keyword containment query on the data
collection’s inverted index (syntactic processing)
4.2 SemIndex Querying Algorithms
4.2.1 SemIndex Search
Algorithm steps:
1- Identifies the index (searchable term) nodes mapping to each query term 2- Identifies, for each of the selected index nodes, the minimum distance paths at
distance using Dijkstra’s shortest path algorithm 3- Identifies the shortest paths which contain data edges linking to data nodes, and
then adds the resulting data nodes to the list of output data nodes, 4- Merges the resulting data nodes with the list of existing answer data nodes. Each
answer node is then assigned a score by adding its distance from every query
COE594 Undergraduate Research Project – Final Report
15
term index node. The algorithm finally returns the list of answer data nodes ranked by order of path scores in ascending order.
4.2.2 SemIndex Disambiguated
Algorithms Steps:
1- Perform WSD on query terms using the Simplified (Adapted)
In SemIndex prototype, the simplified Lesk algorithm is used
2- Identifying in SemIndex the index nodes of disambiguated senses
3- Run the resulting query, starting from the identified index nodes, as a keyword
containment query on SemIndex
4.2.3 SemIndex Query As You Type
Algorithms Steps:
1- Let the user choose the sense of each term in the query according to Wordnet
2- Identifying in SemIndex the index nodes of the chosen senses
3- Run the resulting query, starting from the identified index nodes, as a keyword
containment query on SemIndex
5 Prototype Functionality
5.1 Interface manipulation and customization The interface was upgraded to handle multiple options for customized querying in a way that if
we have two options Option 1 and Option 2 and each have multiple variables such as a1-b1-c1
and a2-b2-c2 respectively and in Option 1 a1 and c1 were selected while in Option 2 a2 and b2
were selected the queries that are executed are the combinations of these selections such as:
Query 1 with a1 a2
Query 2 with a1 b2
Query 3 with c1 a2
Query 4 with c1 b2
Generated data sheets (Excel Files) will have the options in the titles in order to differentiate
between the results i.e : NAME with C1 and A2
In Figure 1 below the querying interface is shown; you can see the multiple options such as:
Query Types, Link Distance (Only shown for SemIndex related query types or hidden), Query
Relaxation Multiplier (Shown only if Query Relaxation is chosen), Query Options and Results.
Moreover the Text Field where the queries are being written now highlights next to each line he
number of the query indicating that each line is a separate query.
COE594 Undergraduate Research Project – Final Report
16
Besides the multi-querying approach, Error Handling (Input Missing/Wrong Combination etc...)
was developed along with hiding and showing some options depending on the selection to
facilitate and highlight each option in example Link Distance is only shown for SemIndex related
queries and the Quality Results in Results makes both the Choose Relevant Data File and
Generate Relevant Data File appear etc.
The prototype was also customized and developed for specific searching methods mainly the
Query As You Type since it requires the user input (as described in 4.2.3), Figure 8 and 9 describe
the process.
Figure 7 - Querying Panel
COE594 Undergraduate Research Project – Final Report
17
Figure 8 - Query as you go panel
COE594 Undergraduate Research Project – Final Report
18
Figure 7 - Query as you go meaning chooser panel
5.2 Range and KNN Search Operators In the prototype shown in Figure 1 and 2, the results of a certain query are normally shown in an
Excel sheet. The numbers of results for a certain query might be 185 where each of them have a
certain score.
KNN was developed and implemented to introduce a threshold for the returning results i.e We
would like the top 25 results from the 185 and thus KNN is set to 25.
Range on the other hand deals with the score, being the computational combinations of weights
leading to the specific result. Hence a range of 0.5 would return all the results having a score
equal or larger than 0.5 and disregard the results less than 0.5. Range is a strong option that gives
the ability of only showing very relevant results i.e Range 0.8
Due to the nature of the prototype Range and KNN can be both used at the same time i.e: We
would like the top 25 results from the 185 having at least a score of 0.6
Shown in Figure 2 both KNN and Range are inputted as simple numbers on the bottom of the
Querying Panel
5.3 Relevant Document Generator The Relevant Document Generator was created and developed in order to facilitate performing
experiments. With the tool developed the user can add IDs directly or search in the database
using keywords and then selection and choosing multiple IDs at the same time. Then the user will
be able to switch between IDs order or remove some IDs before generating the CSV file that will
be used in computations such as Precision, Recall, F-Value, MAP and Kendall’s Tau. The figure
shows the customized interface for the Relevant Document Generator.
COE594 Undergraduate Research Project – Final Report
19
Figure 8 - Relevant Document Generator
COE594 Undergraduate Research Project – Final Report
20
6 Experimental Evaluation
6.1 Experimental Metrics
6.1.1 Quality Precision, Recall and F-Value
Recall and precision are more useful measures that have a fixed range and are easy to compare
across queries and engines. Precision is a measure of the usefulness of a hitlist; recall is a
measure of the completeness of the hitlist.
Recall is defined as:
recall = # relevant hits in hitlist / # relevant documents in the collection
Recall is a measure of how well the engine performs in finding relevant documents. Recall is
100% when every relevant document is retrieved. In theory, it is easy to achieve good recall:
simply return every document in the collection for every query! Therefore, recall by itself is not a
good measure of the quality of a search engine.
Precision is defined as:
precision = # relevant hits in hitlist / # hits in hitlist
Precision is a measure of how well the engine performs in not returning nonrelevant documents.
Precision is 100% when every document returned to the user is relevant to the query. There is no
easy way to achieve 100% precision other than in the trivial case where no document is ever
returned for any query!
Both precision and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).
A retrieval engine must have a high recall to be admissible in most applications. The key in
building better retrieval engines is to increase precision without sacrificing recall.
6.1.2 Ranking: Mean Average Precision
Recall and precision are measures for the entire hitlist. They do not account for the quality of
ranking the hits in the hitlist. Users want the retrieved documents to be ranked according to their
relevance to the query instead of just being returned as a set. The most relevant hits must be in
the top few documents returned for a query. Relevance ranking can be measured by computing
precision at different cut-off points. For example, if the top 10 documents are all relevant to the
query and the next ten are all nonrelevant, we have 100% precision at a cut off of 10 documents
but a 50% precision at a cut off of 20 documents. Relevance ranking in this hitlist is very good
since all relevant documents are above all the nonrelevant ones. Sometimes the term recall at n
is used informally to refer to the actual number of relevant documents up to that point in the
hitlist, i.e., recall at n is the same as (n * precision at n).
Precision at n does not measure recall. A new measure called average precision that combines
precision, relevance ranking, and overall recall is defined as follows:
COE594 Undergraduate Research Project – Final Report
21
Let n be the number of hits in the hitlist; let h[i] be the ith hit in the hitlist; let rel[i] be 1 if h[i] is
relevant and 0 otherwise; let R be the total number of relevant documents in the collection for
the query.
Then precision at hit j is: precision[j] = Σk = 1..j rel[k] / j
Mean Average precision = Σj = 1..n (precision[j] * rel[j]) / R
That is, average precision is the sum of the precision at each relevant hit in the hitlist divided by
the total number of relevant documents in the collection.
Average precision is an ideal measure of the quality of retrieval engines. To get an average
precision of 1.0, the engine must retrieve all relevant documents (i.e., recall = 1.0) and rank them
perfectly (i.e., precision at R = 1.0). Notice that bad hits do not add anything to the numerator,
but they reduce the precision at all points below and thereby reduce the contribution of each
relevant document in the hitlist that is below the bad one. However, average precision is not a
general measure of utility since it does not account any cost for bad hits that appear below all
relevant hits in the hitlist.
6.1.3 Ordering: using Kendall’s Tau
MAP evaluates the ranking of relevant results w.r.t. non-relevant results. Yet, it does not evaluate
the ordering of relevant results w.r.t. the reference ordering. To do so, dedicated statistical
correlation measures can be used, namely: Kendall’s tau coefficient and/or Spearman’s rho
coefficient. Both behave almost the same way (but are computed different): correlation between
two variables will be high when observations have a similar (or identical for a correlation of 1)
rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.)
between the two variables, and low when observations have a dissimilar (or fully different for a
correlation of -1) rank between the two variables.
We will be using Kendall’s tau:
Let (x1, y1), (x2, y2), …, (xn, yn) be a set of observations of the joint random variables X and Y
respectively, such that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and
(xj, yj), where are said to be concordant if the ranks for both elements agree: that is, if both xi > xj
and yi > yj or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj
and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.
The Kendall τ coefficient is defined as:
[-1,1](number ofconcordant pairs ) (number of discordant pairs )
n ( n 1 ) / 2
where the denominator is the total number of pair combinations, so the coefficient must be in
the range −1 ≤ τ ≤ 1.
COE594 Undergraduate Research Project – Final Report
22
If the agreement between the two rankings is perfect (i.e., the two rankings are
the same) the coefficient has value 1.
If the disagreement between the two rankings is perfect (i.e., one ranking is the
reverse of the other) the coefficient has value −1.
If X and Y are independent, then we would expect the coefficient to be
approximately zero.
Example:
1. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <a, b, c, d>
Here, 16 0
6
perfect agreement in ordering: 6 concordant pairs:
(a, a) and (b, b)
(a, a) and (c, c)
(a, a) and (d, d)
(b, b) and (c, c)
(b, b) and (d, d)
(c, c) and (d, d)
2. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <b, a, c, d>
Here, 5 1 2
6 3
1 discordant pair: (a, b) and (b, a)
3. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <b, a, d, c>
Here, 4 2 2
6 6
2 discordant pairs: (a, b) and (b, a)
(d, c) and (c, d)
4. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <d, c, b, a>
Here, 10 6
6
6 discordant pairs:
(a, d) and (b, c)
(a, d) and (c, b)
(a, d) and (d, a)
(b, c) and (c, b)
(b, c) and (d, a)
(c, b) and (d, a)
In our experimental evaluation, we plan to evaluate Kendall’s tau considering the relevant
results which have been retrieved, while disregarding non-relevant results and relevant ones
which were not retrieved. For example:
COE594 Undergraduate Research Project – Final Report
23
5. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <a, b, c>, we will compute Kendall’s tau considering the relevant results
which were retrieved: a, b, and c
Here, 13 0
3
3 concordant pairs:
(a, a) and (b, b)
(a, a) and (c, c)
(b, b) and (c, c)
6. Consider the sequence of relevant results: <a, b, c, d>, and consider the sequence of
returned results <a, x, b, c>, we will compute Kendall’s tau considering the relevant results
which were retrieved: a, b, and c , producing same result as above:
13 0
3
3 concordant pairs:
(a, a) and (b, b)
(a, a) and (c, c)
(b, b) and (c, c)
In other words, MAP will be necessary to evaluate ranking or retrieved relevant w.r.t. non-
relevant results, and Kendall’s tau will be necessary to evaluate ranking (relative ordering) within
retrieved relevant results themselves. We will need to compute both.
6.2 Preliminary Results To validate our approach, we have implemented our SemIndex algorithms using Java. We also have used MySQL 5.6 as an RDBMS, and WordNet 3.0 as a knowledge base. We used the IMBD
movies table2 as an average-scale3 input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more than 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). We formulated different kinds of queries which were organized in three major categories: i) unrelated queries, ii) expanded queries, and iii) identical term queries, shown in Table 2.
Table 2 - Test queries
Query group Q1 – Unrelated queries Query group Q2 – Expanded queries Query group Q3 – Identical terms
ID Terms ID Terms ID Terms
Q1_1 “time” Q2_1 “car” Q3_1 “time”
Q1_2 “love”, “date” Q2_2 “car”, “muscle” Q3_2 “time”, ”time”
Q1_3 “fly”, “power”, “man” Q2_3 “car”, “muscle”, “classic” Q3_3 “time”, ”time”, ”time”
Q1_4 “robot”, “human”, “war”, “world” Q2_4 “car”, “muscle”, “classic”, “speed” Q3_4 “time”, ”time”, ”time”, ”time”
Q1_5 “mafia”, ”kill”, ”mob”, ”hit”, ”family” Q2_5 “car”,“muscle”,“classic”,“speed”,“thrills” Q3_5 “time”, ”time”, ”time”, ”time”, ”time”
2 Internet Movie DataBase raw files are available from online http://www.imdb.com/. We used a dedicated data extraction tool (at
http://imdbpy.sourceforge.net/) to transform IMDB files into a RDB. 3 Tests using large-scale TREC data collections and the Yago ontology as a reference KB are underway within an dedicated comparative
evaluation study.
COE594 Undergraduate Research Project – Final Report
24
The first category consists of queries with varying numbers of selection terms (keywords), e.g., from 1 (single term query) to 5, where all terms are different and all queries are unrelated (i.e., queries with no common selection terms, cf. sample query group Q1 in Table 2). The second category consists of queries with varying numbers of selection terms, where terms are different yet queries are related: such that each query expands its predecessor by adding an additional selection term to the latter (cf. sample query group Q3 in Error! Reference source not found.). The third category consists of queries with varying numbers of selection terms which are identical (cf. sample query group Q3 in Table 2 made of term “time”). We used queries from this category to double check and verify the time complexity of the SemIndex querying process: testing the system’s performance regardless of the nature of the particular terms used4.
We considered 5 groups of queries (made of 5 queries each) within each category (e.g., Q1 is one
of the 5 groups of queries considered within the category of unrelated queries). Each query was
tested on every one of the 100 combinations of SemIndex generated by combining the different
chunks of the IMDB movies table (10%,20%,30%, …, 100%) with every chunk of WordNet (10%,
20%, 30%, …, 100%), at link distance threshold values varying from L = 1 to 5. All queries were
processed 5 times each, retaining average processing time. Hence, all in all, we ran an overall of:
3 (categories) 5 (groups) 5 (queries) 100 (SemIndex chunks) 5 (values) 5 (runs) = 187500
query execution tasks. Hereunder, we present and discuss the results obtained with three sample
query groups Q1, Q2, and Q3, corresponding to each category as shown in Table 2, compiled in
Figure 11 and Figure 12 (remaining query groups show similar behavior, cf. technical report in
[36], and thus were omitted here for ease of presentation).
6.2.1 Querying Effectiveness
In addition to evaluating SemIndex’ efficiency (processing time), we also evaluated its effectiveness (result quality), i.e., evaluating the interestingness of semantic-aware answers from the user’s perspective. To do so, we collected the results of our test queries obtained with and without semantic indexing, i.e., querying using the legacy InvIndex (which is equivalent to executing queries as standard containment queries (L = 1) in SemIndex), and performing semantic-aware queries (L = 2-to-5) with SemIndex. Results were mapped against user feedback (user judgments, utilized as golden truth) evaluating the quality of the matches produced by the system by computing precision and recall metrics commonly utilized in IR evaluation [29]. Precision (PR) identifies the number of correctly returned results, w.r.t. the total number of results (correct and false) returned by the system. Recall (R) underlines the number of correctly returned results, w.r.t. the total number of correct results, including those not returned by the system. In addition to comparing one approach’s precision improvement to another’s recall, it is a widespread practice to consider the f-value, which represents the harmonic mean of precision and recall, such that high precision and recall, and thus high f-value characterize good retrieval quality [29]. Ten test subjects (six master students, and four doctoral students, who were not part of the system development team) were involved in the experiment as human judges. Testers were asked to evaluate the quality of the top 1000 results (movie objects returned) per query (since manually evaluating the tens of thousands of obtained results is practically infeasible)
4 When processing queries made of different terms, the terms will be mapped to different indexing nodes in the SemIndex graph, each
index node highlighting different connectivity properties and graph neighborhoods, which entail different shortest paths when
navigating the graph to perform semantic-aware querying. The span and length of these shortest paths might affect processing time
(cf. time complexity in Section Error! Reference source not found.). Thus, we chose to utilize same term queries (e.g., query
group Q2) to evaluate querying time considering the querying process itself, regardless of the particular properties of the shortest
paths and neighborhoods of each different term.
COE594 Undergraduate Research Project – Final Report
25
obtained with L = 6 (as an upper bound of L = 6 (as an upper bound of obtained with nd four doctoral students, who were not part of the system development team) were involved in the
experiment as human {-1, 0, 1}, i.e., {not relevant, neutral, relevant}) were acquired for each query answer. Then, we quantified inter-tester agreement, by computing pair-wise correlation
scores5 among testers for each of the rated query answers, and subsequently selected the top
500 hundred answers per query having the highest average inter-tester correlation scores6, which we utilized as the experiment’s golden truth. Results for queries of groups Q1 (unrelated queries) and Q2 (expanded queries) are shown in Figure 12 and Figure 13 respectively, whereas overall f-value results are provided in Figure 123. Results highlight several observations. 1) Precision and link distance: One can realize that precision levels computed with both query groups Q1 (unrelated queries, Figure 12a) and Q2 (expanded queries, Figure 13 a) generally increase with link distance (L), until reaching L = 3 (with Q2) or L = 4 (with Q1) where precision starts to slightly decrease toward L = 5. On one hand, this shows that the number of correct (i.e., user expected) results increases as more semantically related terms are covered in the querying
process (with L > 1). On the other hand, this also shows that over-navigating the SemIndex graph
to link terms with semantically related ones located as far as L 3 hops away might include results which: i) are somehow semantically related to the original query terms, but which ii) are not necessarily interesting for the users. For instance, term “congo” (meaning: black tea grown in
China) is linked to term “time” through L = 5 hops in SemIndex (“time” >> “snap” >> “reception” >> “tea” >> “congo”). Yet, results (movie objects) containing term “congo” (e.g., movies about the country Congo, or its continent Africa) were not judged to be relevant by human testers when applying query “time” (testers were probably expecting movies about the passage of time or time travel instead, etc.). 2) Precision and number of query terms: Here, one can realize that precision levels with queries of group Q1 (unrelated queries, Figure 12 a) do not seem to largely vary w.r.t. the number of terms (k) per query, whereas precision levels with queries of group Q2 (expanded queries, Figure 12 .a) clearly increase with k. These seemingly different results are due to the human testers’ expectations, where testers were required to judge the quality of each query’s results given the user’s supposed intent. On one hand, given that queries in group Q1 are unrelated, result quality was evaluated separately for each query, based on the query’s own keyword terms (e.g., the intent of query Q1_1 is identifying movies that have to do with time, and the intent of Q1_3 is movies that have to do with a flying power man, etc.). On the other hand, given that queries in group Q2 are expanded versions of one another, result quality was evaluated based on the user’s intent: which would be naturally expressed with the most expanded (i.e., most expressive) query: Q2_5. One can realize that using fewer query terms here produces lower precision levels, which is due to the system returning more results which are (semantically related to the query terms but which are) not necessary related to the user’s intent (e.g., query “car” might return movies that have to do with trains or taxi cabs, whereas the user is apparently searching for movies that have to do with muscle cars with speed driving and thrills, cf. Q2_5). In other words, with query group Q2: the lesser the number of query terms used, the lesser the query’s expressiveness w.r.t.
5 Using Pearson Correlation Coefficient (PCC), producing scores [-1, 1] such that: -1 designates that one tester’s ratings is a decreasing function
of the other tester’s ratings (i.e., answers deemed relevant by one tester are deemed irrelevant by the other, and vice versa), 1 designates that one tester’s ratings is an increasing function of the other tester’s ratings (i.e., answers are deemed relevant/irrelevant by testers alike), and 0 means that tester ratings are not correlated.
6 Having average inter-tester PCC score 0.4.
COE594 Undergraduate Research Project – Final Report
26
user’s intent, and thus the larger the number of returned results which are not necessarily related to the user’s intent: producing lower precision.
a. Precision results
b. Recall results
Figure 9 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy InvIndex, with query group Q1 (unrelated queries), varying the number of query terms k and link distance
a. Precision results
b. Recall results
Figure 10 Comparing precision (PR) and recall (R) results obtained using SemIndex versus legacy InvIndex, with query group Q2
a. F-value results with query group Q1 b. F-value results with query group Q2
Figure 11 Comparing f-value levels obtained using SemIndex versus legacy InvIndex, with query group Q1 (unrelated queries), and query group Q2 (expanded queries)
3) Recall and link distance: As for recall, one can realize that levels obtained with both Q1 and Q2 steadily increase with link distance (L) varying from L = 1 (legacy InvIndex) to 5 Figure 11 a and
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Pre
cisi
on
N# of query terms k
1 2 3 4 5Link distance lInvIndex
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Re
call
N# of query terms k
1 2 3 4 5InvIndex Link distance l
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Pre
cisi
on
N# of query terms k
1 2 3 4 5Link distance lInvIndex
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Re
call
N# of query terms k
InvIndex 1 2 3 4 5InvIndex Link distance l
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
F-va
lue
N# of query terms k
1 2 3 4 5Link distance lInvIndex
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
F-va
lue
N# of query terms k
InvIndex 1 2 3 4 5InvIndex Link distance l
COE594 Undergraduate Research Project – Final Report
27
Figure 12 b . This maps to observation 1, where the number of correct (i.e., user expected) results returned by the system increases as more semantically related terms are covered in the querying process. In other words, the more the number of correct results which are returned by the system, the fewer the number of correct results which are not returned, and thus the higher the recall levels. Note that returning noisy (incorrect) results along with the correct ones does not affect recall (but rather affects precision as explained in observation 1). 4) Recall and number of query terms: Recall levels vary in a similar fashion to precision levels when varying link distance (L): increasing with the increase of L, which accounts for more semantic coverage (returning more semantically related results) in the SemIndex graph (Figure 11 b and Figure 12 b. However, recall levels tend to decrease (rather than increase) with k. This is due to the fact that shorter (less expressive) queries (i.e., with smaller k values) will naturally return more (semantically related) results than larger queries made of multiple terms (larger k) which will necessarily identify less results (e.g., lesser number of movie objects matching the query’s terms). Hence, a decrease in the number of returned results (with increasing k values) meant the number of correct (and incorrect) results (naturally) decreased, which lead to a decrease in recall. 5) As for f-value results, levels clearly and significantly increase with the increase of link distance L, whereas they slightly decrease with the increase of the number of query keywords k. This naturally confirms the precision and recall levels obtained above, where the determining factor affecting retrieval quality remains link distance, whereas an increase in the number of keywords k tends to reduce system recall with higher values of k (queries becoming very selective, thus missing some relevant results). Note that f-value levels are consistently significantly higher than those obtained with the legacy InvIndex, highlighting a substantial improvement of semantic-aware retrieval quality over syntactic retrieval quality.
6.2.2 Querying Efficiency
We ran the same querying tasks through the legacy InvIndex built on top of IMDB, and compared the obtained query time results with those of SemIndex. Figure 14 and Figure 15 provide results obtained when running queries of group Q1 (unrelated), plotted by varying the number of query terms k (Figure 14) and SemIndex link distance threshold L, in (Figure 15). Similar results were obtained with queries of group Q2 (expanded) and Q3 (identical terms) and have been omitted here for clarity of presentation (they are provided in [36]).
COE594 Undergraduate Research Project – Final Report
28
a. SemIndex query execution time
b. InvIndex versus SemIndex at L = 1 c. InvIndex versus SemIndex at L = 2
d. InvIndex versus SemIndex at L = 3
e. InvIndex versus SemIndex at L = 4
f. InvIndex versus SemIndex at L = 5
Figure 12 Comparison with legacy InvIndex query execution time, considering queries of group Q1 (unrelated queries),
a. SemIndex query execution time
b. InvIndex versus SemIndex at k = 1 c. InvIndex versus SemIndex at k = 2
d. InvIndex versus SemIndex at k = 3
e. InvIndex versus SemIndex at k = 4
f. InvIndex versus SemIndex at k = 5
Figure 13 Comparing SemIndex and legacy InvIndex query execution time, considering queries of group Q1 (unrelated queries),
0
1
2
3
4
5
6
7
8
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
1
2
3
4
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
SemIndex
InvIndex
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
SemIndexInvIndex
0
0.3
0.6
0.9
1.2
1.5
1.8
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
SemIndex
InvIndex
0
0.5
1
1.5
2
2.5
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
SemIndex
InvIndex
0
2
4
6
8
10
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
SemIndex
InvIndex
0
1
2
3
4
5
6
7
8
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
N# of query terms k
1
2
3
4
5
0
0.3
0.6
0.9
1.2
1.5
1.8
1 2 3 4 5
Tim
e (
in s
eco
nd
s)
Link distance l
SemIndex
InvIndex
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5
Tim
e (
in s
eco
nd
s)
Link distance l
SemIndexInvIndex
0
2
4
6
8
10
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
Link distance l
SemIndex
InvIndex
0
1
2
3
4
5
6
7
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
Link distance l
SemIndex
InvIndex
0
1.5
3
4.5
6
7.5
9
1 2 3 4 5
Tim
e (i
n s
eco
nd
s)
Link distance l
SemIndexInvIndex
Link distance l
Link distance l
COE594 Undergraduate Research Project – Final Report
29
On one hand, results in Figure 14 and Figure 15 show that SemIndex and InvIndex have very close query
time levels when link distance is small (L =1 and L =2), such that SemIndex time increases as link distance
increases, reaching its highest levels with L =5 (i.e., almost 8 times higher than InvIndex time levels, with
SemIndex reaching 5 hops into the semantic graph structure to identify more semantically related
results). On the other hand, Figure 14 shows that both SemIndex and InvIndex query time levels slightly
increase when increasing the number of query keywords k with small link distances (L =1 and L =2), such
that the pace of increase tends to augment with k when reaching higher link distance thresholds (L =3-to-
5). In other words, the time to navigate the semantic graph, following the allowed link distance L,,
remains the foremost determining factor in query execution time. Also, results in Figure 15 show that
InvIndex query time is invariant w.r.t. variations in link distance L (since it does not navigate the semantic
graph, and thus does not perform semantic-aware processing).
7 Conclusion
7.1 Summary of Contribution The main goal of our research project was to complete the design and development of
SemIndex’s query evaluation engine. To this end, we designed and implanted various components
and functionality, including: i) a dedicated weight functions, associated with the different
elements of SemIndex, allowing semantic query result selection and ranking, coupled with iii) a
dedicated relevance scoring measure, required in the query evaluation process in order to
retrieve and rank relevant query answers, iii) various alternative query processing algorithms (in
addition to the main algorithm), as well as iv) a dedicated GUI interface allowing user to easily
manipulate the prototype system. Preliminary experimental results highlight our solution’s
querying effectiveness and efficiency.
7.2 Ongoing Works We are currently completing an extended experimental study to evaluate SemIndex’s properties
in terms of i) genericity: to support different types of textual (structured, semi-structured, NoSQL)
data collections, and different semantic knowledge sources (general purpose like: Roget’s
thesaurus [40], Yago [18], and Google [21], as well as domain specific: like ODP [26] for describing
semantic relations between Web pages, FOAF [2] to identify relations between persons in social
networks, and SSG [34] to describe visual and semantic relations between vector graphics), ii)
effectiveness: evaluating the interestingness of semantic-aware query answers considering
different query answer weighting and ranking (result ordering) schemes, in comparison with IR-
based indexing, query expansion, and semantic disambiguation methods, and iii) efficiency: to
reduce the index’s building and query processing costs, using multithreading and various index
fragmentation and sub-graph mining techniques [8].
COE594 Undergraduate Research Project – Final Report
30
References
[1] Agrawal S., Chaudhuri S. and Das G., DBXplorer: enabling keyword search over relational databases. Proceedings of the International Conference on Management of Data (SIGMOD'02), 2002. pp. 627–627.
[2] Aleman-Meza B., Nagarajan M., Ding L., et al., Scalable Semantic Analytics on Social Networks for Addressing the Problem of Conflict of Interest Detection. ACM Transaction on the Web (TWeb), 2008. 2(1):7.
[3] Bednar Peter, Sarnovsky M. and Demko V., RDF vs. NoSQL databases for the Semantic Web applications. IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI'14), 2014. pp. 361-364.
[4] Blanco R., Mika P. and Vigna S., Effective and Efficient Entity Search in RDF data. In International Semantic Web Conference (ISWC'11), 2011. pp. 83–97.
[5] Burton-Jones A.; Storey V.C.; Sugumaran V. and Purao S., A Heuristic-Based Methodology for Semantic Augmentation of User Queries on the Web. In Proceedings ot the International Conference on Conceptual Modeling (ER'03), 2003. pp. 476–489.
[6] Chandramouli K., Kliegr T. , Nemrava J., et al., Query Refinement and user Relevance Feedback for contextualized image retrieval. 5th International Conference on Visual Information Engineering (VIE), 2008. pp. 453 - 458.
[7] Chbeir R., Luo Y., Tekli J., et al., SemIndex: Semantic-Aware Inverted Index. 18th East-European Conference on Advanced Databases and Information Systems (ADBIS'14), 2014. pp. 290-307.
[8] Cheng J., Ke K., Wai-Chee Fu A.., et al., Fast graph query processing with a low-cost index. VLDB Journal, 2011. 20(4): 521-539.
[9] Cheng T., Yan X. and Chang K. C., EntityRank: searching entities directly and holistically. Proceedings of the 33rd international conference on Very Large Data Bases (VLDB'07), 2007. pp. 387-398.
[10] Chu E., Baid A., Chen T., et al., A relational approach to incrementally extracting and querying structure in unstructured data. Proceedings of the 33rd international conference on Very Large Data Bases (VLDB '07), 2007. pp. 1045-1056
[11] Cimiano P.; Handschuh S. and Staab S., Towards the Self-Annotating Web. In Proceedings of the International World Wide Web Conference (WWW'04), 2004. pp. 462-471.
[12] Das S., e.a., Making unstructured data sparql using semantic indexing in oracle database. In Proceedings of 29th IEEE ICDE Conf., 2012. pp. 1405–1416
[13] de Lima E.F. and Pedersen J.O., Phrase Recognition and Expansion for Short, Precision biased Queries based on a Query Log. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. pp. 145-152, Berkeley, CA.
[14] Ding B., Xu Yu J., Wang S., et al., Finding top-k min-cost connected trees in databases. Proceedings of the International Conference on Data Engineering (ICDE'07), 2007.
[15] Florescu D., Kossmann D. and Manolescu I., Integrating Keyword Search into Xml Query Processing, . Computer Networks, 2000. 33(1-6):119–135.
[16] Frakes W.B. and R.A. Baeza-Yates, Information retrieval: Data structures and algorithms. Prentice-Hall, 1992.
[17] Guo L., Shao F., Botev C., et al., XRANK: ranked keyword search over XML documents. Proceedings of the International Conference on Management of Data (SIGMOD), 2003. pp. 16-27.
[18] Hoffart J., Suchanek F.M., Berberich K., et al., YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell., 2013. 194: 28-61.
COE594 Undergraduate Research Project – Final Report
31
[19] Hristidis V., Koudas N., Papakonstantinou Y., et al., Keyword proximity search in xml trees. IEEE Transactions on knowledge and Data Engineering (TKDE), 2006. 8(4):525–539.
[20] Hristidis V. and Papakonstantinou Y., DISCOVER: Keyword search in relational databases. Proceedings of the International Conference on Very Large Databases (VLDB), 2002.
[21] Klapaftis I. and Manandhar S., Evaluating Word Sense Induction and Disamiguation Methods. Language Resourses and Evaluation, 2013. 47(3):579-605.
[22] Li F. and J. H.V., Constructing an Interactive Natural Language Interface for Relational Databases. Proceedings of the VLDB Endowment, 2014. pp. 73-84.
[23] Li Y., Yang H. and Jagadish H.V., Term Disambiguation in Natural Language Query for XML. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), 2006. LNAI 4027, pp. 133–146.
[24] Liu F., Yu C., Meng W., et al., Effective keyword search in relational databases. Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006. pp. 563-574
[25] Luo Y., Lin X., Wang W., et al., Spark: top-k keyword query in relational databases. Proceedings of the 2007 ACM International Conference on Management of Data (SIGMOD-07), 2007. pp. 115-126.
[26] Maguitman A., Menczer F., Roinestad H., et al., Algorithmic Detection of Semantic Similarity. Proceedings of the International Conference on the World Wide Web (WWW), 2005. pp. 107-116.
[27] Markowetz A. and P.D. Yang Y., Keyword search on relational data streams. Proceedings of the International Conference on Management of Data (SIGMOD'07), 2007. pp. 605–616.
[28] Martinenghi D. and Torlone R., Taxonomy-based relaxation of query answering in relational databases. VLDB Journal, 2014. 23(5):747-769.
[29] McGill M., Introduction to Modern Information Retrieval. 1983. McGraw-Hill, New York.
[30] Miller S., Bobrow R., Ingria R., et al., Hidden understanding models of natural language. Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 1994. pp. 25–32.
[31] Mishra C. and Koudas N., Interactive Query Refinement International Conference on Extending Database Technology (EDBT'09), 2009. pp. 862-873.
[32] Navigli R. and Crisafulli G., Inducing Word Senses to Improve Web Search Result Clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. pp. 116–126, MIT, USA.
[33] Nihalani N., Silakari S. and Motwani M., Natural language Interface for Database: A Brief review. International Journal of Computer Science Issues, 2011. 8(2):600-608.
[34] Salameh K., Tekli J. and Chbeir R., SVG-to-RDF Image Semantization. 7th International SISAP Conference, 2014. pp. 214-228.
[35] Sinh Hoa Nguyen, Wojciech Świeboda and G. Jaśkiewicz, Semantic Evaluation of Search Result Clustering Methods. Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467,
, 2013. 467(393-414).
[36] Tekli J., Chbeir R., Luo Y., et al., SemIndex: Semantic-Aware Inverted Index - Technical Report. Available at http://sigappfr.acm.org/Projects/SemIndex/, 2015, 2015.
[37] Voorhees E.M., Query Expansion Using Lexical-Semantic Relations. In Proceedings of the 17th International ACM/SIGIR Conference on Research and Development in Information
Retrieval, 1994. pp. 61-69.
[38] Wen H., Huang G.S. and L. Z., Clustering web search results using semantic information International Conference on Machine Learning and Cybernetics, 2009. 3(1504 - 1509).
COE594 Undergraduate Research Project – Final Report
32
[39] Wu P., Sismanis Y. and Reinwald B., Towards keyword-driven analytical processing. Proceedings of the International Conference on Management of Data (SIGMOD'07), 2007. pp. 617–628.
[40] Yaworsky D., Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. Proceedings of the International Conference on Computational Linguistics (Coling), 1992. Vol 2, pp. 454-460. Nantes.
[41] Zhang C., Naughton J., DeWitt D., et al., On supporting containment queries in relational database management systems. SIGMOD Record, 2001. 30(2), 425–436.