Upload
vodang
View
243
Download
2
Embed Size (px)
Citation preview
Enhancing the Query by Object Approach using Schema Summarization
Techniques
Thesis submitted in partial fulfillment
of the requirements for the degree of
MS by Research
in
Computer Science Engineering
by
Ammar Yasir
200702005
Center of Data Engineering
International Institute of Information Technology
Hyderabad - 500 032, INDIA
July 2015
Copyright c© Ammar Yasir, 2015
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Enhancing the Query-By-Object approach
using Schema Summarization techniques” by Ammar Yasir, has been carried out under my supervision
and is not submitted elsewhere for a degree.
Date Adviser: Prof. P. Krishna Reddy
Dedicated to my parents
Mrs.Shahar Bano, Mr. Ziaul Hasan and sister Sara Hasan for their everlasting love and
support.
Acknowledgments
This dissertation would not have been written without the constant support and encouragement of
many people.
Firstly, I would like to express my deepest gratitude to professor P. Krishna Reddy, for his expert
guidance. He has supported me throughout my thesis with invaluable discussions and feedback. He also
encouraged me to take up challenging problems and gave me freedom to explore my ideas.
I would also like to thank my colleagues in IT for Agriculture Lab and Center for Data Engineering,
especially M Kumara Swamy and R Uday Kiran sir for their critical comments and constructive sug-
gestions. I would also like to thank my labmates Gowtham Srinivas, Somya, Satheesh for their fruitful
discussions.
I am grateful to all my friends for providing constant support and motivation. Ashray, Abhinav, Rohit
Nigam, Rohit Gautam, Romit, Shubhangi, Sankalp, Ankur Goel, Vinay, Shrikant, Rakshit, Siddharth
and Ankit made my stay in IIIT as one of the best experiences of my life.
Lastly, I am forever indebted to my mother Mrs. Shahar Bano and my father Mr. Ziaul Hasan for
their patience, understanding and encouragement.
v
Abstract
Modern day organizations use databases to manage information for their business operations. Since
the introduction of DBMSs in the mid-1960s, database technology has made significant advances in
terms of functionality and performance. As a result, modern day database systems can process a large
number of complex queries on any database. An important area of database research focuses on improv-
ing the usability of databases. Research efforts are ongoing to develop efficient user interfaces to access
information from databases, focusing not only on the design of user-interfaces but more importantly,
improving the process of user interaction and the underlying architecture.
Information Requirement Elicitation (IRE) was proposed in the literature, which recommends a
framework for developing interactive interfaces, allowing users to access database systems without hav-
ing prior knowledge of a query language. An approach called ‘Query-by-Object’ (QBO) has been
proposed in the literature for IRE by exploiting simple calculator like operations. In QBO, the database
is represented with the help of objects and operators are provided to relate information between objects.
However, the QBO approach was proposed by assuming that the underlying database is simple and con-
tains a few tables, each of small size. Large databases have complex database schemas. Given a large
number of tables in a schema, the number of objects is also large. Locating information of interest and
how it is related to other objects becomes a challenging task for the user. Also, the number of possi-
ble operations between objects increase significantly. In this thesis, we investigate opportunities for a
better organization of options available to the user for interacting with the database without making any
changes to the organization of data at the physical layer. First, we try to determine entities in the schema
that collectively represent a conceptual unit or topic in the database. Similarly, we try to organize in-
stances of an object by organizing them into a hierarchy based on attribute values. The organization of
objects into topics allows the user to relate information at a higher level of abstraction and leverages the
number of operational pairs that needed to be defined in QBO. We also evaluate the research decisions
through system analysis and usability studies, which were conducted with the help of a fully functional
prototype developed for a real, complex database.
An important process in the proposed approach is discovering topical structures in the database
schema. The problem has gained attention recently in the database community as the problem of
Schema Summarization. Schema summarization for a relational database schema is a challenge that
involves identifying semantically correlated elements in a database schema. Research efforts are being
made to propose schema summarization approaches by exploiting database schema and data stored in
vi
viithe database. Existing efforts for schema summarization are data oriented. In scenarios where data is
inconsistent or insufficient, existing approaches suffer. In this thesis, we explore the database documen-
tation as an information source. We aim to utilize the schema and database documentation to provide an
efficient schema summary. We propose a notion of table similarity by exploiting the referential relation-
ship between tables and the similarity of passages describing the corresponding tables in the database
documentation. Using the notion of table similarity, we propose a clustering based approach for schema
summary generation. Experimental results on a benchmark database show the proposed approach, al-
though independent of data stored in the database, is as efficient as the data-oriented approaches.
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of Existing Efforts for Access Methods in Database Systems . . . . . . . . . 2
1.2 Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach . . 3
1.2.1.1 Overview of Query-by-Object approach . . . . . . . . . . . . . . . 3
1.2.1.2 Issues with Query-by-Object approach . . . . . . . . . . . . . . . . 4
1.2.1.3 Proposed Enhanced Query-by-Object Approach . . . . . . . . . . . 4
1.2.2 Overview of Proposed Schema Summarization Approach . . . . . . . . . . . . 5
1.2.2.1 Overview of Schema Summarization . . . . . . . . . . . . . . . . . 5
1.2.2.2 Proposed Approach for Schema Summarization . . . . . . . . . . . 6
1.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Innovative Query Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Visual Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Text Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Other Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Schema Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Mining Database Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Topical Structures in Databases . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Schema Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Enhanced Query-by-Object Approach for Information Requirement Elicitation in Large Databases 16
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Query-by-Object Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Discovering Topical Structures in Databases . . . . . . . . . . . . . . . . . . . 19
3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2.1 Organization into topics: . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2.2 Facilitating Instance Selection: . . . . . . . . . . . . . . . . . . . . 23
viii
CONTENTS ix3.2.2.3 Defining Operations: . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.3 QBT protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3.1 QBT Developer Protocol . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3.2 QBT User protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3 Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3.1 Experiment 1, Task Analysis: . . . . . . . . . . . . . . . . . . . . . 32
3.4.3.2 Experiment 2, Use Survey: . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3.3 Limitations and possible improvements for the usability study . . . . 35
3.5 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Exploiting Schema and Documentation for Summarizing Relational Databases . . . . . . . . 37
4.1 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Schema based Table Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 Documentation based Table Similarity . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3.1 Finding Relevant Text from the Documentation: . . . . . . . . . . . 40
4.1.3.2 Similarity of passages: . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.4 Table Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.5 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.5.1 Influential tables and Cluster Centers . . . . . . . . . . . . . . . . . 44
4.1.5.2 Clustering Objective Function . . . . . . . . . . . . . . . . . . . . . 44
4.1.5.3 Clustering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Effect of window function (f ) on combined table similarity and clustering . . . 46
4.2.4 Effect of document similarity measure (S) on similarity metric and clustering
accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.5 Effect of contribution factor (α) on table similarity and clustering . . . . . . . 47
4.2.6 Comparison of Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
List of Figures
Figure Page
1.1 The TPCE schema without table categories . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 QBO user protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The iDisc approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Topical Structure for QBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 QBT user protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 System Prototype Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Traditional calculator versus System Prototype UI . . . . . . . . . . . . . . . . . . . . 29
3.8 System Prototype UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9 Treemap representation of user’s selection(object and granularity) . . . . . . . . . . . 30
3.10 QBO Approach Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.11 QBT Approach Prototype(with Topic modeling and binning) . . . . . . . . . . . . . . 32
3.12 Average ratings for questions from questionnaire . . . . . . . . . . . . . . . . . . . . 34
4.1 TPCE Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 accsim and accclust values on varying window function, f . . . . . . . . . . . . . . . 46
4.3 accsim and accclust values for document similarity functions S . . . . . . . . . . . . . 46
4.4 Accuracy of similarity metric on varying values of α . . . . . . . . . . . . . . . . . . 47
4.5 Accuracy of clustering on varying values of α . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Clustering accuracy for different clustering algorithms . . . . . . . . . . . . . . . . . 48
x
List of Tables
Table Page
3.1 Operator Matrix for Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 QBO Developer and User Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Within-Topic Matrix 1(WT-I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Between-Topic Matrix(BT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Time taken and number of attempts for each task . . . . . . . . . . . . . . . . . . . . 32
3.6 Query building time results for QBT . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Query building time results for QBO . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Referential Similarity between tables security, daily market and watch item . . . . . . 40
xi
Chapter 1
Introduction
A database is a well-organized collection of related data. For example, an address book which
stores name, phone numbers and addresses of people you know, represents a database. A database
management system (DBMS) is a collection of programs that enables users to create and maintain a
database. The DBMS is a general purpose software system that facilitates the process of defining,
constructing, manipulating, and sharing databases among various users and applications.
Since their introduction in the mid-1960s, DBMSs have enjoyed enormous success. An important
feature of a DBMS is that it offers data independence. Application programs utilizing the database
are insulated from the changes in the way data is structured and stored. A DBMS provides a suite
of sophisticated techniques to store and retrieve data efficiently. It also has a potential for enforcing
standards among database users in a large organization, for example, name and formats of data elements,
terminology, and display formats. DBMS also ensures the security of the database by enforcing access
controls for users and also ensures durability, the recovery of the database in face of failures, errors
of many kinds or intentional misuse. Overall, the prime selling feature of the database approach has
been the reduced application development time. A DBMS provides support for important functions that
are common to all applications accessing data in the DBMS, making application development less time
consuming.
With the rapid increase of published information and the abundance of data, users require sophisti-
cated tools to simplify the task of managing data and extracting useful information in a timely fashion.
To deliver such sophisticated systems, database technology has made great strides in the area of data
storage, transaction management, concurrency control and query interfaces. As a result, modern day
DBMSs can efficiently process a large number of complex queries on any database. Although ad-
vances in database technology have concentrated heavily on functionality and performance, ‘usability’
of databases leaves a lot to be desired. The important aspect while discussing the usability of a database
is not just the design of the user interface, but also more importantly the process of interaction and the
underlying architecture.
1
In this thesis, we have made two contributions, first, we propose enhancements for the Query-by-
Object approach by using schema summarization techniques. Also, we propose an efficient approach
for generating schema summary by utilizing the schema structure and database documentation.
In the remaining part of this chapter, we will first overview existing efforts for providing access
to data in database systems and issues involved in providing efficient data access. Then we give an
overview of our proposed approach in the thesis. Further, we discuss the issue of schema summa-
rization, review the existing approaches for schema summarization and give an overview of proposed
approach for schema summarization. Finally, we mention the major contributions made in the thesis
and organization of the thesis.
1.1 Overview of Existing Efforts for Access Methods in Database Sys-
tems
In this section we discuss some of the common approaches for providing access to data in a database
system:
• Database Query Interfaces : Structured query models like SQL or XQuery are powerful means
of interacting with the database. SQL is a textual language with a simple English-like syntax, and
is widely implemented in most commercial database systems. Alternatively, users can use visual
query systems (VQSs) [1]. VQSs are query systems for databases that use visual representations
to locate information of interest and express related requests. VQSs can be seen as an evolution of
query languages and were aimed to improve the effectiveness of the human-computer interaction.
Query-by-Example [2], for example, allowed users to query a database by creating example tables
in the query interface and has influenced many commercial products like Microsoft Access. Form-
based interfaces are widely regarded as the most user-friendly querying method. A form is a
named collection of objects having the same structure. The structured representation of a query
form is an abstraction of conventional paper forms, therefore users felt at ease with the system.
The system presented in [3, 4] provided visual tools for users to frame queries using forms.
• Keyword Search : Searches are a specialized class of queries [5]. A search consists of keywords
representing the user’s information requirements, and the underlying data is usually a collection
of unstructured documents. A search engine retrieves the documents relevant to the query and
ranks the retrieved documents. The keyword search query mechanism allows users to freely
express their query requirements and coupled with instantaneous response time, makes it easier
to refine queries. Although a mainstay of Information Retrieval (IR) systems, the keyword-search
approach has been extended to databases domain as well [6]. Systems such as BANKS [7] and
DBXplorer [8] provide an IR-style keyword-base search engine over relational data.
2
• Information Requirement Elicitation : In m-commerce environment, the ‘Information Require-
ment Elicitation’ (IRE) approach and its conceptual design was proposed by Sun [9]. IRE de-
scribes an interactive communication in which information systems help users to specify their
requirements with adaptive choice prompts. Users initiate IRE sessions by expressing their needs.
In an IRE enabled system, there is an IRE component, which gets triggered upon receiving user’s
request. The IRE component checks whether the information requirement is specific enough. If
not, the component generates choice prompts for the missing elements by utilizing user inputs,
user context, and user preference. The loop continues until the required request information can
be provided to the user. A prototype of IRE in an imagined m-commerce scenario is demonstrated
in [10].
1.2 Overview of Proposed Approach
Structured query models like SQL/XQuery are very efficient in the context of expressing a query.
However, these models require a user to specify a query using a fixed syntax, have a prior knowledge
of the database structure and model and express the query in terms of that particular structure. Novice
users are not skilled at using SQL like query languages as they support a complex structure. While
VQSs offer a friendlier approach, systems like QBE do not perform well with large schema. Secondly,
a user needs to be aware of values of the database to fill the example tables. Another challenge for the
user is grasping the join relationships between data entities to express complex queries. Similarly, form-
based interfaces, although convenient for the users, pose a limitation on the number of queries that can
be executed. Keyword-search approach is not entirely effective as users express queries with complex
semantics and expect precise, complete results. The IRE approach proposed a grand framework whose
potential has not been fulfilled yet. Based on the notion of IRE, Query-by-Object (QBO) approach was
proposed for developing query interfaces. In this thesis, we propose enhancements to the existing QBO
approach, to design user interfaces efficiently for large databases. Another important area in the context
of database usability is to generate the summary for a complex database schema. As part of this thesis,
we also propose techniques to generate efficient schema summary.
1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach
In this section, we present a brief overview of the QBO approach and the challenges involved in
developing user interfaces based on the QBO approach. Later we describe an overview of the proposed
approach for an enhanced query-by-object approach.
1.2.1.1 Overview of Query-by-Object approach
IRE uses a series of steps to elicit information from users where each step adds to the information
about the user’s intent. However, IRE does not permit users to utilize the results of intermediate queries
3
to progressively build complex queries. Based on the notion of IRE, the Query by Object (QBO) in-
terface was proposed by Bhalla et al. in [11, 12] for m-Commerce environment. In this system, users
communicate with a DBMS through a web interface. The user’s intent is captured via objects and path
navigation through an option-based interface. In the end, a query is formulated and executed at the
DBMS server by converting it into its SQL equivalent.
Initially, the user is presented with an object menu. Users perform navigation operations and select
one or two objects at desired level of granularity. Unlike IRE, the QBO approach supports closure
property, which facilitates that each step executes and uses its result in the next step. It allows users to
gather and combine query results. It allows users to search for information in a logical way, whereby
intermediate results are refined or combined to get the intended result. Query-By-Object approach has
been used to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems
[15]. An empirical study was conducted in [14] to evaluate user interaction through the QBO interface.
The study showed that the QBO approach is easy, intuitive and simple to use for common users.
1.2.1.2 Issues with Query-by-Object approach
Designing user interfaces based on QBO approach to provide information access for a general pur-
pose database is a challenging issue [16]. The QBO approach uses a database to store objects and
operations, where each object corresponds to a relation in the schema. Developing user interfaces based
on QBO approach becomes challenging when the complexity of the underlying database (schema and
data) increases. Large number of tables in the schema makes it harder for the user to locate his/her in-
formation of interest and how it is related to other elements in the schema. Similarly, this issue is more
important when the object instances are large in number. Hence, there is a need for better organization
of options available to the user in the QBO interface. Also with the increase in the number of tables, the
number of pairwise operations between tables increase significantly.
1.2.1.3 Proposed Enhanced Query-by-Object Approach
To address the issues of QBO, first, we exploit the notion of detecting topical structures in databases
to represent the schema at a higher level of abstraction. Identifying topical structures allows tables which
are semantically correlated to be grouped together, which provides a better organization for options pre-
sented to the users. We use an elaborated approach called iDisc [17], which utilizes the database schema
structure and the data stored in the database to generate a clustering of schema entities, representing the
topical structure of the database. We discuss the iDisc approach in detail in Chapter 3. Secondly, in-
stead of defining operations between each pair of tables, we can define operations between topics and
within topics which can reduce the number of pairs for which operators need to be defined. Similarly,
to facilitate easier instance selection, we allow selection of instances based on attribute values of the
table. Later, we organize instances of an attribute into bins, providing a two-level hierarchy for instance
selection. The developer protocol is modified to include steps required to generate the abstract lev-
4
Figure 1.1 The TPCE schema without table categories
els. Consequently, the user protocol is also modified for the proposed approach. We also discuss the
engineering of a prototype based on the proposed approach.
1.2.2 Overview of Proposed Schema Summarization Approach
In this section, we briefly describe the issue of schema summarization, the approaches proposed in
the literature and the proposed approach for schema summarization.
1.2.2.1 Overview of Schema Summarization
Detecting topical structures of the database schema is an interesting challenge. In the literature, the
term schema summarization has been used interchangeably with detection of topical structures. Modern
enterprise databases consist of hundreds of interlinked tables. While users are accustomed to data being
represented in two-dimensional tables, grasping joins between tables is a challenge for general users.
For example, Figure 1.1 describes the schema diagram of the TPCE benchmark database. The TPCE
database stimulates the working of an online brokerage firm. Although being morderate in terms of
schema size, complex relationships in the schema makes it difficult for users to familiarize themselves
5
with the database schema. As complexity of database schemas increase, the amount of time spent on
understanding the metadata and schema structure becomes significant.
Database normalization is a process of analyzing the given relation schemas based on their functional
dependencies and primary keys to achieve the desirable properties of (1) minimize redundancy and (2)
minimize the insertion, deletion and deletion anomalies. Unsatisfactory relational schemas that do not
meet the normal form tests are decomposed into smaller relation schemas the meet the tests and hence
achieve the desirable properties. However, through the process of normalization what users perceive
as a single independent unit of information is disintegrated into smaller relations. Coupled with odd
naming conventions for tables, this makes it harder for a user to locate his information of interest easily.
Schema summarization has been proposed to assist users in understanding a complex database schema
easily.
A schema summary represents a higher level of abstraction of the database schema. A user is initially
presented with a few important concepts from the database. Subsequently, a user can zoom into sections
of the schema in which he is interested. Generating a schema summary involves identifying semantically
correlated elements in the schema. Existing approaches [18, 19, 20, 21] exploit the schema structure and
data stored in the database to generate schema summary with a clustering based approach. In scenarios
where the data stored in databases is insufficient, the existing approaches suffer. In this thesis, given a
current snapshot of the database (schema), we investigate the database documentation as an additional
source of information and propose an algorithm to generate summary by exploiting the database schema
structure and documentation.
1.2.2.2 Proposed Approach for Schema Summarization
A foreign key relationship between two tables shows that there exists a semantic relationship be-
tween two tables. However, referential relationships alone do not provide good results [53]. Hence we
attempt to supplement this referential similarity between tables with another notion of similarity, such
that the tables belonging to one category attain higher intra-category similarity. The additional similarity
criteria is based on finding similarity between the passage of text representing the table in the database
documentation. The intuition behind this notion of similarity is that the tables belonging to the same
category should share some common terms about the category in the documentation. We combine the
referential similarity and document similarity by using a weighted function and obtain a table similarity
metric over the relational database schema. After pairwise similarity between tables is identified, we
use a Weighted K-Center clustering algorithm to partition tables into k clusters. Experiments conducted
on a benchmark database show that the proposed approach is as effective as the existing data-oriented
approaches.
6
1.3 Thesis Contribution
• Proposed enhancements to the existing QBO approach by detecting topical structures in the
databases.
• Presented an advanced system to query relational databases, based on the enhanced QBO ap-
proach.
• Explored the database documentation as a source of information for generating schema summary
and proposed an algorithm to exploit the database schema and documentation to generate efficient
schema summary.
1.4 Thesis Organization
In the next chapter, we discuss the QBO approach and the issue of designing QBO interface for com-
plex databases. We discuss proposed enhancements over the existing QBO approach and also discuss
results from system and user level evaluation of the system. We also study an advanced system for
querying relational databases, the tool used in usability evaluations of the approach proposed in Chapter
3. In Chapter 4, we present the problem of summarizing a relational database and propose an algorithm
to generate schema summary by utilizing schema and documentation. We also discuss a thorough ex-
perimental evaluation of the proposed approach. Chapter 5 presents the summary of the work discussed
in the thesis, conclusions, and the future work.
7
Chapter 2
Related Work
One of the earliest works in the field of database usability [22], focused on analyzing the expressive
power of a declarative query language SEQUEL, in comparison to natural language. However, the
importance of usability in database systems was first addressed in [23]. Since then, most of the research
efforts in the context of database usability have been focused on developing innovative query interfaces.
In [24], the author describes the initial enthusiasm and user-induced frustration in building of interactive
information systems.
In 2007, Jagadish et al. [25] provided a second wind to the research domain, discussing a set of
five ‘pain points’ on why database are so difficult to use. The first pain point describes how complex
schema structure makes it hard for the users to locate their information of interest and construct relevant
queries. The authors propose that an abstraction of the presentation data model is needed to allow users
to structure information in a natural way. As users have different views on the organization of data in a
database, various personalized presentation models are developed for different class of users. However,
when users are presented with multiple views, they do not understand the underlying difference between
the views and tend to become confused and lose trust in the system. This issue is discussed as the
second pain point in the context of database usability. The third pain point deals with the issue of
users getting an unexpected result or being unable to query without getting any explanation from the
database system. The fourth pain point describes that the existing query interfaces are not modeled as
WYSIWYG (What-You-See-Is-What-You-Get), which is a desired quality in any user interface. The
last pain point discusses that the creation of a database is a challenging task for novice users and is a
reason why a lot of modern day information is not present in databases. The authors later introduced a
presentation data model for direct data manipulation with a schema later approach.
An important aspect of usability in databases is to provide information access with minimal efforts
to database end-users. In the literature, various visual query systems and textual interfaces have been
proposed to provide efficient data access. We review some of the prominent works in the field of query
systems in Section 2.1. In the context of improving database usability, generating schema summary
for complex database schemas has also received attention of late. We review some of the proposed
techniques for generating efficient schema summary in Section 2.2.
8
2.1 Innovative Query Interfaces
2.1.1 Visual Interfaces
Using visual representations for query specification is perhaps the most researched field in the context
of database usability. Query-by-Example (QBE) [2] was one of the first graphical query languages
with minimum syntax developed for database systems. QBE and its successor Office-by-Example [26]
were both based on domain relational calculus [27]. In QBE, rather than specifying a query using a
fixed syntax; the query is formulated by filling templates of the relations, displayed visually on the
computer screen. The inputs to the template can be translated into an SQL equivalent and executed on
the database. QBE does not require the knowledge of any syntactic constructs or the internal structure
of the database to use, as users are presented only the equivalent table skeletons. QBE is relationally
complete. With some additional commands, condition boxes and other constructs, users can express
all queries that belong to the class of relational algebra. It has been an influence on developing visual
querying facilities in products like Microsoft Access, IBM Visual XQuery Builder, Borland’s Paradox
and open source tools like query builder for phpMyAdmin.
Query by Templates (QBT) [28] was a generalization of QBE language for databases modeled with
SGML. QBT incorporates the structure of the documents for composing powerful queries by displaying
a template for a representative entry in the database. The template describes the type of data expected
in the database. The user specifies examples of data in the template and system retrieves data matched
by the user-specified template, similar to QBE. QBT allows various templates like flat templates, nested
templates and structured templates, unlike QBE where the table skeleton is the only available template.
Query-by-Diagram (QBD*) [29], is a visual query system that allows navigation based on abstrac-
tions of the E-R semantic model. QBD* allows users to extract information from the database without
worrying about the logical model of the schema. The process of query formulation in QBD* is as fol-
lows: The query structure is based on the selection of a main concept, which is the first entity selected
by the user. The user then performs navigation on the ER model to select paths starting from the main
concept. The path represents a subquery that selects a subset of instances of the main concept. Set-
based operations like union, intersection and difference are available to combine various subqueries.
The main feature of QBD* is that it provides a graphical mechanism capable of expressing recursive
queries (transitive closure).
Query-by-Icon (QBI) [30] provides an icon-based visual query system capable of querying and ex-
ploring databases. QBI provides an interface with pure iconic specifications, without the use of any
diagrams. A user perceives the underlying database as a set of classes, each having several properties
called generalized attributes (GA). Generalized attributes encapsulate and hide from the user the details
of specifying a query. To construct a query, users select compatible classes via their corresponding
icons. When users select a class, its GAs are used to define the selection condition. Similarly, the user
also selects GAs, which will be part of the output result. Query results are saved to be explored further
9
in the construction of complex queries. A comparison study of QBD* and QBI [1] suggested that expert
users perform better using the QBD* system while QBI performed slightly better for non-expert users.
VISIONARY [31] is a visual query language, based on diagrammatic paradigm like QBD*. In
VISIONARY, a vision represents the external data model that uses a combination of icons and text to
provide visual primitives of concepts and associations, which is represented by a name and multiplicity.
Users formulate queries by choosing a primary concept, the selection predicates and the attributes to be
retrieved in output. If the interpretation given to a query is not the one the user had in mind, the user can
force a different interpretation by disabling some associations and enabling others. The internal data
model is relational, using SQL as the query language. An intermediate graph-based model provides
mapping between the visual and the relational models.
Kaleidoquery [32] is a powerful visual query language for object databases, supporting the capabil-
ities of the OQL language. Kaleidoquery uses a visual filter flow model, where filters are used to filter
out information of interest for users. The class instances are considered as information flow and using
constraints on the class attributes, information is filtered. The output of the query is then examined, or
it flows into other queries to be further refined. Kaleidoquery separates the tasks of writing the query
constraints and organizing the structure and ordering of the results, providing a more dynamic evolution
of queries than OQL.
Liu and Jagadish [33] designed a spreadsheet algebra for the relational database that continuously
presents data to users in a WYSIWYG (What You See Is What You Get) manner. By dividing query
specification into progressive refinement steps, users can extend intermediate results to construct com-
plex queries. The data manipulation actions are reversible, and users can modify an operation specified
earlier without redoing the later operations. Users can also specify at least all single-block SQL queries
without being exposed to complex database concepts. Non-technical users benefit from the direct ma-
nipulation interface as it allows easier and more accurate specification of queries.
VISQUE [34] describes a visual interaction language by exploiting End-User Development tech-
niques, web-based user interface design and data models. VISQUE uses knowledge visualization tech-
niques like a tree-based metaphor to represent a multidimensional database schema and also allows
construction of complex SQL like using set-based, nested and aggregation queries.
Due to the popularity of touch-based and motion tracking device, research efforts have been made
to design user interfaces that allow gesture-based querying over relational databases [35]. The database
query interface allows users to manipulate results directly by interacting with them in a sequence of ges-
tures. Corresponding to each table, a view is created in the workspace that can be directly manipulated.
Each gesture denotes a single manipulation action and impacts only the view. User needs to learn few
gestures that correspond to an action. Users can undo each action to return to the previous workspace
state. Each action corresponds to the execution of a specific SQL query. Actions are stackable and can
be performed in sequence, manipulating tables in the workspace till the desired result is achieved.
Application developers designing query interfaces for a specific purpose prefer to use form-based
interfaces [4, 3]. In form-based interfaces,the user is presented with a list of searchable fields, each with
10
an entry area that can be used to indicate the search string. To pose a query, the user needs to fill in the
areas of the form relevant to their search. The form-based approach is especially relevant as end-users
are accustomed to manual form-based work.
In [4] the authors study a simple form model that includes hierarchically structured forms with an
event-driven routing. To assist users in the creation of forms for view definitions, an inference com-
ponent was provided to create view definitions consisting of the hierarchical structure and functional
dependencies among form fields. The inference component uses a collection of rules and heuristics
along with a purposeful dialog. The Expert Database Design System [4] assists a designer in the view
integration process. The system provides rules for incrementally integrating the form views and heuris-
tics for mapping the form fields into entity types and relationships. Some other form-based systems for
databases are, the GRIDS system [36], which allowed users to pose queries in a semi-IR fashion and
Acuity Project [37], which used form generation techniques for data-entry operations such as updating
tables in a relational database.
In [38], the authors tried to automate the process of construction of query forms. With the help of
a limited number of forms, the system can express a wide range of queries which helps in leveraging
the restriction on expressiveness posed by form based querying mechanisms. Given a set of interesting
queries, similar queries are identified and subsequently clustered to represent queries that can be queried
using a single form.
2.1.2 Text Interfaces
With the explosion of data availability on the web and the ease of access to data through search
engines, we observe databases playing second fiddle in terms of popularity. Search engines, for example,
Google [39], allow users to issue keyword-based queries freely and coupled with instantaneous response
time allows for a satisfactory experience for the user. While there are still room for improvements, the
success story of the web search engines suggests that any data management system is more useful if
users can extract information from the system with minimal efforts.
Keyword searches in databases [6] allow users to query databases using a set of keywords. The
BANKS system [7] integrates keyword querying and interactive browsing of databases. BANKS models
a database as a graph, where tuples correspond to nodes, foreign key and other links between tuples
correspond to edges. Answers to a query are modeled as rooted trees connecting tuples that match
individual keywords in the query. Answers are ranked using a notion of proximity coupled with a
notion of prestige of nodes based on inlinks, the latter being inspired by techniques developed for web
search. Another keyword-search system DBXplorer [8], uses a symbol table that is used at search time
to determine efficiently the locations of query keywords in the database. Given a set of keywords, the
symbol table is looked up to identify database tables and all potential subsets of tables in the database
that might contain rows having all keywords, are identified and enumerated. For each enumerated join
tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects the rows
that contain all keywords. The system then presents the final rows to the user.
11
Keyword search has also been extended for XML databases. The aim of such systems is to identify
the smallest element that contains most of the keywords [40] or the smallest element that is meaningful
[41]. In [42], the authors describe ObjectRank which uses a metric of authority transfer on a data graph
to improve result quality for ranking results in keyword searches in the database. Ranking of SQL query
results has been studied in [43] using probabilistic models.
Research efforts have also focused on combining form-based approaches and keyword search. Given
a set of keywords, a system retrieves a set of forms instead of query results [44, 45]. The systems
create inverted SQL queries from the SQL queries in the forms. Unlike traditional keyword search on
databases, the techniques do not require any special purpose indices, and instead make use of standard
text indices supported by most database systems.
Some information systems use a ‘page-and-link’ approach for accessing data resource, for example,
a web directory. A Web Directory is a repository of Web pages that are organized into a topic hierarchy.
Typically, directory users locate the information sought simply by browsing through the topic hierarchy,
identifying the relevant topics and finally examining the pages listed under the relevant topics. Some of
the common web directories include [46, 47]. Users select a related link as per their needs. Each link
helps users in narrowing down to his information needs.
2.1.3 Other Works
In [48], the authors proposed a new paradigm for data interaction called guided interaction, which
uses interaction to guide the users in query construction, query execution and result in examination
processes. The authors mandate that databases should be responsive to the user, and all possible actions
are enumerated so as to allow discovery and exploration. The database can also preemptively deliver
insights to aid in query construction. The proposed paradigm supports its value for any general database
interaction interface, whether that be SQL-writing, form-filling, keyword-typing or any other interface.
The authors suggested how information in the database could be leveraged to guide a user during query
construction by following these core principles.
Query recommendation is a popular feature in modern systems, especially search engines. These
recommendations are built by mining search query logs from existing users [49, 50]. The method
proposed in [49] is based on a query clustering process that identifies semantically similar queries by
exploiting historical preferences of registered users. The method also ranks the semantically correlated
queries. In [50], the authors model search engine user’s sequential search behavior, representing it as
query refinement process. This model is combined with a traditional content-based similarity method
to compensate for the high sparsity of real query log data. In [51], the concept of auto-completion
was proposed to rapidly suggest predicates to the user to create conjunctive SELECT-PROJECT-JOIN
queries. In [52], authors proposed a method to mine SQL query logs and identify potential query
templates.
In complete-search [53], Bast et al. modify the inverted index data structure to provide incrementally
changing search results for document search. TASTIER [54] provides find-as-you-type in relational
12
databases by partitioning the relation graph. In the information retrieval area, Anick et al. [55] achieve
interactive query refinement by extracting key concepts from the results and presenting them to the user.
Faceted search [56] extends this to present the user with multiple facets of the results, allowing for
mixing of search and browse steps.
2.2 Schema Summarization
2.2.1 Schema Matching
Information integration in an important challenge in data management [57, 58]. Schema match-
ing [59] involves identifying semantic correspondences or mappings among attributes from different
databases. In [60, 61], authors describe schema oriented approaches for finding correlated schema
elements using name, description, relationship and constraints. In [62] authors have proposed an inte-
grated approach, using linguistic approaches and structure matching process. [63] proposed a fragment
oriented approach for matching large schemas to reduce the matching complexity. Identifying mappings
is analogous to finding similarity between schema elements belonging to two different schemas.
2.2.2 Mining Database Structures
Mining database structure has received attention recently [64, 65, 66]. Bellamn [64] performs data
mining on the database structure and identifies attributes with similar values, discovers join relationship
among tables while also identifying direction and sizes. Such analysis can help in preparing data for data
mining or for identifying foreign keys for schema mapping. In [65], the authors addressed the problem
of mining a data instance for structural clues. The structural clues helps in identifying data instances
that may contain errors, missing values, and duplicate records that may ultimately be helpful in data
design. The authors proposed a set of information-theoretic tools that identify structural summaries that
are useful for characterizing the information content of the data.
2.2.3 Topical Structures in Databases
Wu et.al. [21] proposed an elaborate approach, iDisc, to discover topical structures in relational
databases. The approach first models the database in three representation, graph based, vector-based and
similarity-based. Graph-based representation models the database as a graph where the tables represent
the nodes in the graph, and foreign key relationships represent the edges. In vector-based representation,
each table is modeled as a document and hence the database represents a collection of documents.
Similarity-based representation computes a similarity matrix by considering the similarity of attributes
between schema elements. iDisc then performs clustering on each of the database representations and
then combines the clustering using a voting scheme to generate topical structures.
13
2.2.4 Schema Summarization
The problem of schema summarization was coined by Jagadish et al. in [18]. The proposed approach
for generating schema summary utilized abstract elements and abstract links. Each abstract element rep-
resents a cluster of original schema elements, and abstract link represents one or more link between the
schema elements within those abstract elements. The authors used the notion of summary importance
and summary coverage to generate schema summary representing important schema elements with a
broader coverage.
The approach in [18] was proposed in the context of XML schema. The assumption made in [18]
could not be applied to relation schema. Yang et al. [20] proposed an improved algorithm for relational
schema summarization. The authors proposed a new definition for the importance of tables in a rela-
tional database schema based on information theory and statistical models. The authors also described
a novel distance function that quantified the similarity between elements in the schema. Based on the
distance function, a clustering based approach was proposed for generating schema summary.
In [19], the authors apply the technique of community detection in social networks for schema sum-
marization. The approach partitioned the database schema elements into subject groups by using mod-
ularity based community detection. By utilizing the table importance measure proposed in [20], a
hierarchal clustering algorithm was proposed for multi-level navigation structure in schema summary.
The schema summary described foreign-key relationships, subclass relationships and overlap of data
instances.
2.3 Discussion
Although VQSs like QBE and its derivatives are relationally complete and use user-friendlier that
SQL/XQuery, they still require prior knowledge of schema structure and a grasp of join relationships
between tables to some extent. Query interfaces like the form-based interfaces restrict the number of
queries you can construct for data access. In the proposed effort, we aim to provide an easy to use
interface for novice users, which can leverage the number of queries a user can execute on the system.
In keyword search systems, although users are content with querying using keywords, they need to
express more complex query semantics. Also, users expect more precise and complete answers to their
queries while keyword search based systems may return many irrelevant results without any explanation.
A similar scenario is experienced by the query recommender system. In the proposed approach, we
emphasize on providing precise and complete answers like the structured query language.
Schema matching involves identifying semantic correspondences or mappings among attributes from
different databases whereas the proposed approach identifies semantically correlated elements within
a schema. In [64, 65], the aim was to identify semantic relationships (foreign key) between tables.
The proposed approach aims to identify clusters of strongly correlated schema elements. The existing
schema summarization approaches [18, 20, 19, 21] are data oriented, utilizing schema and data available
14
in the tables. In contrast, the proposed approach uses schema information and database documentation
to generate schema summary.
15
Chapter 3
Enhanced Query-by-Object Approach for Information Requirement
Elicitation in Large Databases
Databases are more useful when users can extract information from the database with minimal ef-
forts. Most database systems provide powerful, structured query models like SQL to query the database.
However, these models require users to specify an unambiguous query explicitly using a fixed syntax
and have a prior knowledge of the database structure, which is unfavorable for novice users. Hence, al-
ternate query interfaces are required for information access that are more suited to the skills of a novice
user yet still provide expressive power like SQL. Research efforts are going on to design efficient query
interfaces that simplify the process of accessing information stored in a database.
Information Requirement Elicitation (IRE) [9] proposes an interactive framework for accessing infor-
mation. IRE proposes that user interfaces should allow users to specify their information requirements
using adaptive choice prompts. In the literature, Query-By-Object (QBO) approach has been proposed
to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems [15] based
on IRE framework. The QBO approach provides a web-based interface for building a query using mul-
tiple user level steps. The main advantage of this approach is simplicity to express a query. The QBO
approach uses a database to store the objects and entities. However, for databases with large number of
tables and rows, the QBO approach does not scale well.
In this chapter, we propose an improved QBO approach, Query-by-Topics (QBT), to design user
interfaces based on IRE framework that works on large relational databases. In the proposed approach,
we represent the objects at a higher level of abstraction by clustering database entities and representing
each cluster as a topic. Similarly, we organize instances of an entity in groups based on values of a user-
selected attribute. The aim of this chapter is not to propose an approach for detecting topical structures
but rather how such an approach can provide applications in practical scenarios like information systems.
Experiments were conducted at the system and user level on a real dataset using a QBT based prototype
and the results obtained are encouraging.
The rest of the chapter is organized as follows. In Section 3.1, we explain the QBO approach and
discovering topical structures in a database. In Section 3.2, we present the proposed framework. In
Section 3.3, we discuss the prototype development based on the proposed approach. In Section 3.4,
16
Figure 3.1 QBO user protocol
we present experiments and analysis of the proposed approach. The last section contains summary and
conclusions.
3.1 Background
In this section, we explain the Query-By-Object Approach (QBO) in detail and also describe the
framework for discovering topical structures in databases.
3.1.1 Query-by-Object Approach
‘Information Requirement Elicitation’ [9] framework allows users to build their queries in a series
of steps. The result of each step is used to determine the user’s intent. Based on the notion of IRE,
the Query-By-Object (QBO) approach was proposed in [14]. In this approach, the user communicates
with a database through a high-level interface. The initial intent of the user is captured via selection of
objects from an object menu. The user navigates to select the granularity of these objects and operators
to operate between the selected objects. The user’s actions are kept track in a query-bag, visible to the
user at all stages. Finally, an SQL equivalent query is formulated and is executed at DBMS server. In
the IRE framework, intermediate queries cannot be utilized further and hence, there is not much support
for complex queries. In QBO, user is allowed to gather and combine query results. This is supported by
closure property of the interface. It states that the result of an operation on objects leads to the formation
of another object. Hence, the results of a query can be used to answer an extended query. As the QBO
interface involves multiple user level steps, non-technical users can easily understand and use the system
for retrieving information from the databases. The developer protocol and user protocol (Figure 3.1) for
the QBO approach are as follows:
3.1.2 Example
Consider an example where a developer builds a QBO based system that users will query.
System development based on QBO Developer Protocol: The following steps are taken by the devel-
oper:
17
film actor fim actor
film U, I, C R R
actor R U, I, C R
flim actor R R U, I, C
Table 3.1 Operator Matrix for Example 1
QBO Developer Protocol QBO User Protocol
1. Select an object.
1. Store objects and entities in a RDBMS. 2. Select granularity of object.
2. Define operators for each pair of objects. 3. Select another object.
3. Provide IRE based object selection, operation 4. Select the operator.
selection and support for closure property. 5. Display result.
6. If required, extend query by selecting
another object.
Table 3.2 QBO Developer and User Protocols
• Database:
– film - (film id, film name, film rating)
– actor - (actor id, actor name)
– film actor - (fim id, actor id, actor rating);
• In this approach, the relations in the entity-relationship (ER) data model are considered as objects.
Next, the developer defines pair wise operations between these objects. Four types operators were
proposed: UNION (U), INTERSECT (I), COMPLEMENT (C) and RELATE (R). The ‘RELATE’
operator has different connotations depending on the chosen objects it operates on. The pairwise
operations are shown in Table 3.1.
• A web-based interface provides a list of objects, instances and operations user can select from.
The system also allows the user to combine query responses.
Steps taken by the user based on QBO User Protocol: Consider an example query that the user is
interested in, Find all actors who have worked with the actor ‘Jack’. Such query can be expressed with
QBO as: Find names of films actor ‘Jack’ has worked in, then find names of actors who worked in these
films. User level steps are:
• Select object: actor
• Select granularity: actor-‘Jack’
• Select another object: film
18
Figure 3.2 The iDisc approach
• Select operator: Relate
• Select another object: actor
• Select operator: Relate
• Display result
3.1.3 Discovering Topical Structures in Databases
Discovering topical structures in databases allows us to group semantically related tables in a single
group, helping in identifying what users might perceive as a single unit of information in the database.
Consider a database D, consisting of a set of tables T = {T1, T2...Tn}. Topical structure of D describes
a partitioning, C = {C1, C2, ..Ck} of tables in T such that the tables in the same partition have a
semantic relationship and belong to one subject area. In [17], the authors proposed iDisc, a system
which discovers topical structure in a database by clustering tables into quality clusters. Clustering [67]
is the process of grouping a set of data objects into multiple groups or clusters so that objects within a
cluster have high similarity, but are very dissimilar to objects in other clusters.
The iDisc approach is described in Figure 3.2. The input to iDisc is D consisting of a set of tables
T and returns a clustering C of the tables in T . In the iDisc approach, a database is first modeled by
various representations namely vector-based, graph-based and similarity-based.
In the vector-based model, each table is represented as a document in bag-of-words model and a
database is hence represented as a set of documents. In the graph-based model, the database is repre-
19
sented as an undirected graph. The nodes in the graph are the tables in the database (T ). Two tables Ti
and Tj share an edge in the undirected graph if there exists a foreign key relationship between Ti and
Tj . In the similarity-based representation, a database D is represented as a n × n similarity matrix M ,
where n = |T | and M [i, j] represents the similarity between tables Ti and tables Tj . The similarity
between two tables is calculated by finding matching attributes based on a greedy matching strategy
[68]. The table similarity is then averaged out over the similarities of matching attributes.
In the next phase, clustering algorithms are implemented for each database representation model.
The vector-based model and similarity-based model use hierarchical agglomerative clustering algorithm
approach. A cluster quality metric is defined to measure the cluster quality. For the graph-based repre-
sentation, shortest path betweenness and spectral graph partitioning techniques are used for partitioning
the graph into connected components. Similar to other representations, a cluster quality metric is used
to measure the quality of connected components. After clustering process ends, the base-clusterer for
each representation selects the clustering with the highest quality score and preliminary clustering for
each representation is discovered.
After identifying preliminary centerings, iDisc uses a multi-level aggregation approach to aggregate
results from each clustering using a voting scheme to generate final cluseters. A clusterer boosting tech-
nique is also used in the aggregation approach by assigning weights to produce more accurate clustering
representations. Later, representatives for each cluster is discovered using an importance metric based
on centrality score of the tables in the graph-based representation. The output of iDisc is a clustering of
tables in the database, where each labeled clusters represents a topic.
3.2 Proposed Approach
In this section, we first present a case study for eSaguTM , an IT-based personalized agro-advisory
system. From the case study, we highlight our motivation and the problem we aim to solve. Later, we
discuss the proposed approach in detail.
3.2.1 Case Study
The eSagu system aims to improve the productivity of farms by delivering high quality personalized
(farm-specific) agro-expert advice in a timely manner to each farm at the farmer’s door-steps without
farmer asking a question. In eSagu, the agriculture scientist, rather than visiting the crop in person,
delivers the expert advice by getting the crop status in the form of both digital photographs and the
related information. The eSagu system records data about the farmers, farm history, sowing details,
soil details, crop details and information about problems/diseases observed by farmers. Agro-experts
need to analyze the observation data from various perspectives to deliver personalized advice and have
complex query requirements. Also, query requirements tend to change frequently. The agro-experts are
familiar with the data domain but are not technical experts. Hence, there is a need for a higher level
20
interface and presentation model to access data in the eSagu system. The issue here is that the query
interface proposed to elicit information requirement of non-technical users should be easy to use while
still allowing users to pose a wide range of queries.
The QBO approach and its merits have been discussed in Section 3.1. To design user interfaces based
on QBO to provide information requirement elicitation for eSagu, we face the following scenarios:
• Implement the eSagu system in a RDBMS, where each table would correspond to an object. The
eSagu database consists of 84 tables.
• Define operations between 84 × 84 object pairs.
• Provide a web-based interface providing a list of tables (84 tables) and instances (some tables
containing more than 104 rows).
Use Case: Consider the scenario when a user is trying to query the eSagu database using a web-based
interface designed using the developer’s protocol. The user protocol would include:
• Select an object: a user would have to analyze a list of 84 objects and locate his object of interest.
• Select granularity or instance selection: Even if instance selection is based on attribute values,
attributes can have a large number of distinct values.
• Select operator: A user would have to grasp how each object would relate to other objects.
A complex database may contain a large number of tables in the schema due to conceptual design or
schema normalization. In such cases, it is difficult for the user to locate his information of interest.
A naive solution, to organize objects alphabetically, may not be efficient. For example, in the eSagu
database, there are 35 tables for various crop observations, cotton observation, crossandara observation
and likewise 33 others. If a user wants to browse through all such observation tables, he would need to
know all the crop names. An organized list where crop observation tables are grouped together and then
sorted alphabetically would be more intuitive for the user. Hence when the objects are more in number,
there is a need to represent the objects at a higher level of abstraction. Similarly, there is a need for a
better organization when the object instances are more in number.
In general we are faced with the following problems for QBO developers and users:
• Large number of tables in the schema makes it harder for the user to locate his information of
interest.
• With a large number of instances in each table, selection of desired instance becomes difficult.
• With a large number of tables, the number of pairwise operations between tables also increase.
For n tables in the schema, in the worst case n× n operational pairs exist.
21
3.2.2 Basic Idea
In the proposed approach, we exploit the notion of detecting topical structures in databases to rep-
resent the schema at a higher level of abstraction. Identifying topical structures allows tables which are
semantically correlated to be grouped together, which provides a better organization for options pre-
sented to the users. Secondly, instead of defining operations between each pair of tables, we can define
operations between topics and within topics. Hence, the number of pairs for which operators have to
be defined can be reduced significantly. Similarly, to facilitate easier instance selection, we organize
instances of an attribute into bins, providing a two-level hierarchy for instance selection. The developer
protocol is modified to include steps required to generate the abstract levels. Consequently, the user
protocol is also modified for the proposed approach.
The proposed approach has the following additional processes to QBO:
• Organizing objects into topical structures.
• Facilitating instance selection
• Defining operators for the topical structure.
We discuss each of these process in detail in the following subsections.
3.2.2.1 Organization into topics:
For organizing objects into topical structures, we use the iDisc approach described in section 3.2.
Given a database containing a set of tables T = (T1, T2, ..Tn) as input, the iDisc framework generates a
clustering C = (C1, C2, ..Ck) of tables in the schema along with representative tables for each cluster
L = (L1, L2, ..Lk). Ci represents the set of tables belonging to the ith cluster, where Li represents the
representative table of the cluster Ci. The representative table’s name of a cluster is used to assign a
label to cluster. Each labeled cluster collectively represents a topic in the database.
In QBO approach, the hierarchy of information organization is as follows:
Tables→ Attributes→ Attribute Instances
After generating topical structures of the database, we make the following modification in the hierarchy
of organization:
Topics→ Tables→ Attributes→ Attribute Instances
In other terms, we introduce topics and present the database tables belonging to a topic as its granularity.
Hence, an object in QBT is a topic that has three levels of granularity (tables, attributes and attribute
instances), in contrast to QBO which had only attributes and attribute instances as the two levels of
granularity. Our approach is also in accordance with the IRE framework. By introducing topics, users
can browse the database contents semantically, providing more intuitive options to the users.
22
3.2.2.2 Facilitating Instance Selection:
For selecting an instance(s) of an object, selection based on an attribute values comes naturally to
the user. Thus, we first ask the user to select an attribute and then select its instances. However, in
case the number of instances of an attribute are large, we need an efficient organization of options.
Here we have two problems in conflict as while we allow the user to drill down to his requirements in
multiple steps, we may end up creating too many steps that are unfavorable for the user. We thus create
a two-level hierarchy for attribute values such that there are few steps required for instance selection
while providing a better organization. In the two-level hierarchy, we organize the attribute instances
by grouping the attribute instances into intervals. The first level represents the intervals and the second
level represents the instances itself.
Considering values of an attribute as a data distribution, we relate creating intervals to determine bins
for creating histograms for a given data distribution. Methods for calculating number of bins (k) given
a data distribution are as follows:
• Struge’s formula: k = ⌈log2n+ 1⌉
• Square root choice: k =√n
• Scott’s choice (based on bin width): h = 3.5σ
n1
3
, where h represents bin width
• Freedman-Diaconis’s choice: h = 2× IQR(x)
n1
3
, where IQR = interquartile range
We would like to point out that the aim of proposed approach is to make it easier for the user to select
instances. For example, if we have a textual attribute representing names of people in a community, one
simple solution can also be binning based on first alphabet of the name, rather than distribution. Taking
textual attributes into perspective, we additionally provide a search box for textual attributes, which can
act as a filter for attribute selection. The usability of the search tool becomes even more prominent if
the textual attributes contain long texts.
3.2.2.3 Defining Operations:
Next, we need to define operators that perform in case of QBT. Operators enable us to perform
complex queries on databases involving one or more objects. The selected objects act as operands to the
operators. We define two types of operator matrix:
i Within-Topic Operator Matrix (WTS): This matrix represents all possible operations within a
topic. The WTS matrix includes operations between a topic’s representative table with other tables
belonging to the topic and between the tables in a same topic.
23
Figure 3.3 Topical Structure for QBT
ii Between-Topics Operator Matrix (BTS): This matrix represents the possible operations between
the representative tables of each topic. The diagonal elements represent the WTS matrix of the
topics and other non-diagonal elements represent operations between two distinct topics.
By defining operational pairs between topics and within a topic, we reduce the number of operation
pairs for which operations need to be defined. The reduction in operational pairs depends on the topical
structure identified for the database. Figure 3.3 shows an example of the organization of tables into
topical structures. A topic is represented by its representative-table and all tables belonging to a topic
are called its subordinate tables. The first subscript represents the topic and second describes whether
the table is a representative table or a subordinate table of the topic. Tables of each topic are further
represented as a, b, and so on. Table 3.3 describes the Within Topic matrix for the first Topic (WT-I)
and table 3.4 describes the Between Topic matrix (BT). The following scenarios come up in context of
Figure 3.3,
t T11 T12a T12b T12c
T11 U,I,C R R R
T12a R U,I,C R R
T12b R R U,I,C R
T12c R R R U,I,C
Table 3.3 Within-Topic Matrix 1(WT-I)
t T11 T21
T11 [WT-I] R
T21 R [WT-II]
Table 3.4 Between-Topic Matrix(BT)
Scenario 1. The two selected objects belong to the same topic. It has further three possibilities:
24
• Both the tables are representative tables {T11,T11}: As there is only one representative table for
each topic, this represents operations between same tables. The possible operations will be pro-
vided in Within-Topic operator matrix (WT-I[1,1]).
• One of the table is representative-table and the other is a subordinate-table {T11,T12a}: This
case represents a RELATE operation between the two tables. The operations will be defined in
Within-Topic operator matrix (WT-I[1,2]).
• Both the tables are subordinate tables {T12a,T12b}: In this case, the two tables relate directly or
through the representative table of the corresponding topic. In this case, the operations are per-
formed at a higher level (WT-I[2,3]).
Scenario 2. The two selected objects belong to different topics. It has three further possibilities:
• Both the selected tables are representative-tables {T11,T21}: The possible operations will be de-
fined in Between-Topics-I operator matrix (BT-I[1,2]).
• One table is a representative-table and other is a subordinate table {T11,T22a}: In this case, the
tables can be related at the higher level via the representative-tables of the two topics (BT-I[1,2]).
• Both the tables are subordinate tables ({T12a,T22a}): Similar to the above case, the two tables
can be related through their representative-tables. The possible operations are defined in Between-
Topics-I matrix(BT[1,2]).
3.2.3 QBT protocols
In this section we describe the QBT developer protocol and QBT user protocol:
3.2.3.1 QBT Developer Protocol
• Store objects and entities in a database (RDBMS)
• Organize the tables in a schema based on the topic of tables, as described in Section 4.1.1.
• Create a framework to organize attribute instances into the two-level hierarchy, as explained in
Section 4.1.2.
• Define operations within each topic and between topics, described in Section 4.1.3.
25
• Provide an interface based on QBT, to allow object selection, instance selection and support clo-
sure property.
3.2.3.2 QBT User protocol
The user protocol for QBT is described in Figure 3.4. The main options in the QBT are as follows.
• Select a topic
• Select granularity (a table, attribute and attribute values)
• Select another topic
• Select an operation
• Display result
• Extend query, if required
Figure 3.4 QBT user protocol
3.3 System Prototype
In this section, we discuss the prototypes developed for QBO approach and the QBT approach based
on the notion of IRE. As shown in Figure 3.5, the system prototype is based on a client-server archi-
tecture. Users interact with the system with a web-based user interface (EQBO client), which allows
objects selection, operators selection and also displays query results. The back-end (EQBO server) con-
sists of a system that processes the inputs given by a user and generates an SQL query that is executed
on a relational database server (MySQL). The results of the SQL query are informed to the user at every
stage of interaction. The user interface was implemented in PHP using open source jQueryUI tools and
visual tools. Options available to user were refined by means of AJAX calls to the system, using JSON
objects for information transfer between client and server.
The developer protocols were followed to define objects and operations. For the QBO prototype,
each table in the database corresponds to an object. The attributes of a table are considered as its
26
Figure 3.5 System Prototype Architecture
granularity, based on which instances of the object can be selected. In the QBT prototype, we discovered
topical structures in the database, with topics corresponding to objects.
Operators are required when a user wants to relate information from one object with another object.
Analogous to a calculator, a query can expressed as: A op B = C, where A is the left operand, B is the
right operand, op represents the operator and C represents the result. A and B represent objects defined
in the database. Considering objects analogous to numbers in a calculator, operators can be unary, which
require a single left operand (A) as argument and binary operators, which require both left operand (A)
and right operand (B) as arguments. In a calculator, one or two objects of the same type (numbers)
operate and result into an object of same type (number). However in our system, objects are of different
types as each object corresponds to a table containing different attribute. Consequently, depending on
the operator selected, the resulting object can be of type A, type B or type A join B.
Four binary operators are defined for any general purpose database: ADD (union), MINUS (com-
plement), AND (intersect) and RELATE. For two same objects, binary operators ADD, MINUS and
AND operators are defined. For two different objects, binary operator RELATE provides a natural join
between objects. For each object, unary operators are defined corresponding to each direct join rela-
tionship it has with other objects. In addition to default operators, the database administrator can define
various domain specific operators to provide more flexibility to end users.
3.3.1 CONFIG-DB
Configuration information corresponding to different databases is stored in the CONFIG-DB (Figure
3.6). The CONFIG-DB consists of an object table which stores the name of objects identified from a
27
Figure 3.6 CONFIG-DB
database. The CONFIG-DB simply maintains an index for objects. Object granularity and attribute
values are accessed from the original database in which the object belongs. If the database has to be
represented as topics, a topics table is defined similarly to objects. Each topic has a representative object.
The objects belonging to the same topic have same topic id, otherwise topic id is null. In addition, the
unary operators table and binary operators table stores operator details such as left operand object, right
operand object, SQL query for the operator, resultant object and icon location to visually represent the
operator in the user interface. For any database, by default each table indexed as an object and default
binary and unary operators are defined. However, the CONFIG-DB can be re-populated by the database
administrator to allow topic representation or to design more operators.
3.3.2 User Interface
The design of user interface was motivated by the aim to provide an interface analogous to a tradi-
tional calculator. As presented in Figure 3.7, selection of numbers is analogous to object selection (red),
operator selection for numbers is analogous operator selection for objects(blue). The display section of
a traditional calculator that displays the result, numbers or operator selected is analogous to a query-bag
that keeps track of user’s interactions and intermediate results (green). Figure 3.8 depicts the user in-
28
Figure 3.7 Traditional calculator versus System Prototype UI
terface, which consists of four section (each represented by numbered arrows). The first section (1) of
Figure 3.8 describes the interaction process of selecting objects, granularity and instances. In general,
this section should describe an efficient representation of database schema and data. It has been widely
studied that visual representation of objects can be easily manipulated by the user. However, for large
scale databases that contain large number of tables, attributes and attribute values, visual representation
are complex and restricted by screen size. Consider a database with 100 tables, each table consisting of
15 attributes on average and each attribute containing 1000 distinct attribute values on average. Visual
representations like treemaps, which provide a compact representation of hierarchal data suffer badly as
the screen becomes densely packed. Similarly, graph representations, which can represent both object
and relationships between objects also suffer as the network structure becomes dense and confusing to
the user. To deal with large scale database, we use cascading menus to represent the database hierarchy.
The left most menu represents objects, grouped by topics. On selection of an object, its attributes are
represented in the second menu. Subsequently, attribute selection leads to the third menu consisting of
attribute values. Since the attribute values are probable to be more in number, a search box is provided
to users for locating desired information. The second section (2) of Figure 3.8 represents the operator
selection. Operators are represented as a grid of buttons similar to a grid of operators in a traditional
calculator. The operator grid is updated based on object selected. In addition to operators, the grid
contains a ‘backspace’ and ‘calculate’ button to undo previous selections and evaluate an expression
respectively. An important design choice is the use of icons along with the textual representation of
operator functionality. An icon can provide a visual representation of the functionality of the operator
that can be easily manipulated by the end user. For example, a + icon is displayed for an operation
where you want to add some more instances an object to an existing object. The third section (3) of
Figure 3.8 describes the representation of a query-bag which keeps track of user’s selections (objects
and operators), similar to the display section a calculator. In general, user’s selection will represent a
very small subset of options available. We can thus use visual tools to represent user’s selections and
29
Figure 3.8 System Prototype UI
Figure 3.9 Treemap representation of user’s selection(object and granularity)
operation results. Object selection is represented through treemaps as they provide a compact represen-
tation for hierarchal data and operator selection is represented via icons. For example, Figure 3.9 shows
the treemap representation of user selection of a farmer object with granularity for gender as male and
age as 20, 21 or 22. A point to note is that the treemaps are not displayed until the user has made a
selection on an object. The fourth section (4) of Figure 3.8 represents results of SQL query, formulated
based on user’s interactions. Each selection made by the user updates the query, correspondingly a SQL
query is executed on the database server and the SQL query results are presented back to the user in
real time. The SQL query results are displayed using the query-by-example (QBE) approach. Real time
presentation of SQL query results allows users to validate their existing selections at every stage and
reduces the probability of formulating a wrong query.
30
3.4 Experiments
3.4.1 Experimental Methodology
To analyze the effectiveness of the proposed approach, we conducted system-level experiments and
also conducted a usability study. System-level experiments consist of evaluating the reduction in navi-
gation burden and reduction in the number of operational pairs from the QBO approach. The usability
study consists of a task analysis and ease of use survey on a real database using real users. For the
usability study, we developed two prototypes, one based on the QBO approach and one based on the
QBT approach. The interface for both the approach is almost similar, except that the QBO prototype
does not group object by topics and does not provide bins for instances. First we do a task analysis
of the QBT prototype and QBO prototype to check whether the proposed approach is beneficial to the
user. To compensate for the limitations of the task analysis (discussed later in Section 3.4.3.1), we ask
the user to explore the database on their own and pose queries from their day-to-day requirements using
both the prototypes. After the exploration session, they fill out a questionnaire, rating the prototypes. It
may not be the most efficient usability evaluation but it reduces the bias from task analysis.
3.4.2 Performance Analysis
We measure the effect of using topical structures at the system level by measuring the reduction
factor (RF ) for operational pairs. The reduction factor represents the number of operation pairs in the
QBT approach as compared to the QBO approach (RFop). If the number of operation pairs in QBT are
OPqbt and in case of QBO are OPqbo, the reduction factor (RFop) is defined as follows:
RFop = 1− OPqbt
OPqbo
(3.1)
We illustrate the metric by referring to Figure 3.3, where the total number of tables are 8. When
tables have been divided into two topics, the number of operation pairs are follows: two 4 × 4 WT
matrix and one diagonal BT matrix (2 pairs). Hence OPqbt is 34, while OPqbo is 64 (8 × 8). The
reduction factor for operation pairs is 0.46. For the eSagu database, after identifying topical structures,
operational pairs were calculated for the between topics matrix (BT) and within topics matrix (WT). The
reduction factor for operational pairs (RFop) observed was 0.76.
3.4.3 Usability Study
Usability tests were conducted on four real users having computer experience but were not skilled at
SQL or query languages. The users belonged to the age group 20-26 and were agriculture consultants
at IT for Agriculture lab, IIIT Hyderabad. The users were familiar with the database domain, mainly
eSagu and can validate the query results comfortably. Users were then briefed about the QBT prototype
for 15 minutes along with a quick demonstration of a sample query. Before the experiments, users
31
Figure 3.10 QBO Approach Prototype Figure 3.11 QBT Approach Proto-
type(with Topic modeling and bin-
ning)
Time taken in seconds
Task (attempts taken)
User1 User2 User3 User4
T1 21(1) 16(1) 41(2) 22(1)
T2 18(2) 31(2) 30(1) 27(1)
T3 170(3) 81(2) 79(1) 112(2)
T4 17(1) 18(1) 22(1) 25(1)
T5 25(1) 18(1) 41(2) 24(1)
T6 140(2) 151(2) 110(2) 103(2)
Table 3.5 Time taken and number of attempts for each task
were allowed a 5 minutes practice session to get themselves acquainted with the tool before starting the
experiments. We performed two experiments: Task analysis and Use Survey [69].
3.4.3.1 Experiment 1, Task Analysis:
After the initial interactive session the users were given six tasks. The tasks are as following:
• T1: Find the details of family members for the farmer D.Laxama Reddy.
• T2: Find all the farms owned by the farmer named Polepally Thirumalreddy.
• T3: Find all the observations given to farmers from Malkapur village who grow cotton crops.
• T4: Find the details of livestock belonging to the farmer d.laxama reddy.
• T5: Find all the farmers belonging to the coordinator named k .s narayana.
• T6: Find all the advice given to farmers from Malkapur village.
32
Task Min Time Max Time Average Std. Deviation Avg. time for query construction
T1 16 41 25 10.98 20
T2 18 31 26.5 5.91 17.66
T3 79 170 110.5 42.44 55.25
Table 3.6 Query building time results for QBT
Task Min Time Max Time Average Std. Deviation Avg. time for query construction
T4 17 25 20.5 3.69 20.5
T5 18 41 27 10.29 21.6
T6 103 151 126 23.13 63
Table 3.7 Query building time results for QBO
Each task involved constructing a query corresponding to the task requirement and retrieving the
correct result. Ideally, we would like to evaluate the two prototypes by evaluating results for same
tasks. However, if the user performs tasks on one of the prototype and then performs the same task
on another prototype, the second prototype would be at an advantage because the user already gains
a prior experience about performing the tasks. To address this issue, we instead divide the tasks into
two groups of three tasks each. The first three tasks (T1, T2 and T3) would be performed on QBT
prototype while last three tasks (T4, T5 and T6) would be performed on the QBO prototype. While we
have different tasks being performed for the two prototypes, we try to maintain that the tasks are similar
in nature and complexity. We maintained that the task T1 is similar to task T4 with the difference in
objects corresponding to the family details and livestock details. Similarly, T2 and T5 represent a join
operation, differing only in terms of the object involved. Tasks T3 and T6 represent a complex join
involving three objects.
Table 3.5 shows the time taken by each user to build his query for all six tasks and also the total
number of attempts taken to complete each task. Note that we only account for the time taken by the
user to build the query and not the time taken by the system to execute the query. The average time to
complete all the tasks was 5 minutes and 36 seconds. The longest time to complete the six tasks was
6 minutes 31 seconds while the fastest time was 5 minutes and 13 seconds. The standard deviation to
complete all six tasks was 37 seconds.
Table 3.6 and Table 3.7 show the query building times for the two prototypes. Additionally, for QBT,
the average time to complete the first three tasks (T1, T2 and T3) successfully was about 2 minutes
and 35 seconds. The longest time taken to complete the tasks on the QBT prototype was 3 minutes
and 29 seconds and the fastest time was 2 minutes and 3 seconds. The standard deviation of time to
complete the first three queries was 40 seconds. For QBO, the average time for the last three tasks (T4,
T5 and T6) was 2 minutes and 54 seconds, while the longest time was 3 minutes and 7 seconds and
the fastest time was 2 minutes and 32 seconds. The standard deviation of time to complete the last
three tasks was 16 seconds. The average number of trials required by users to complete all six tasks
was 9.75. The average number of attempts required to complete the first three tasks was 4.75 while the
33
Figure 3.12 Average ratings for questions from questionnaire
average number of attempts required to complete the last three tasks was 4.25. The maximum number of
attempts required by a user for any single task was 3 (for T3). The average time for query construction
for the first three tasks was 34 seconds while the average time for query construction for the last three
tasks was 41 seconds.
As discussed in the experimental methodology, as the tasks are performed first on QBT prototype and
then the rest of the tasks are performed on the QBO prototype. The QBO prototype has an advantage
that the user is already accustomed to perform similar tasks earlier on the QBT prototype. However,
the average time for query construction for QBT is less than for QBO which shows that users are able
locate their information quicker in QBT than in QBO.
3.4.3.2 Experiment 2, Use Survey:
After the task evaluation, we conducted a survey to determine how the users felt about the prototypes
individually. Users were asked to explore the prototypes and pose various queries from their day-to-day
requirements. After the users had explored the database using the two prototypes they were asked to fill
in a questionnaire based on a USE survey. The questionnaire asked the users to rate both the prototypes
based on the following questions:
• Q1: The tool is easy to use.
• Q2: The tool sufficient for my information requirements
• Q3: The tool can be used with minimal efforts.
• Q4: The tool requires minimal training and can be used without written instructions.
• Q5: I can locate my information easily.
• Q6: The tool requires minimal steps to formulate a query.
34
The users had to respond to each question on a scale ranging from 0 (completely disagree) to 10
(completely agree). Finally, each user was requested to give some feedback about the general perception
of the prototypes, to obtain and identify additional comments about strengths and weaknesses to improve
the tool.
In figure 3.12, we represent the average ratings provided by the users for each of the questions. The
mean rating for the QBT prototype was 6.95 with a standard deviation of 0.24. The mean rating for
the QBO prototype was 6.33 with a standard deviation of 0.30. For Q1, the QBT prototype received
an average rating of 7.25 while QBO prototype received an average rating of 6.5. For Q2, the QBT
prototype received an average rating of 6.75 while QBO received an average rating of 6.25. For Q3, the
QBT prototype received an average rating of 7 while the QBO prototype received an average rating of
6.5. For Q4, both the QBT prototype and the QBO prototype received an average rating of 6.75. The
two prototypes do not differ much in the user interface design as much as in the process of interaction.
For Q5, the QBT prototype received an average rating of 7.25, whereas the QBO prototype received an
average rating of 6. In QBT prototype, we have introduced topics to organize objects that help the user
to locate objects quickly. For Q6, the QBT prototype received an average rating of 6.75, whereas the
QBO prototype received an average rating of 6. The highest ratings for the QBT prototype was received
for questions Q1 and Q5 while the lowest ratings were received for Q2, Q4 and Q6 alike. For the QBO
prototype, the highest rating was received for Q4, and the lowest rating was received for questions Q5
and Q6.
From the use survey, we see that the QBT prototype receives highest ratings for Q1 and Q5 which
shows that after exploring the data through the prototype, the users feel that the QBT prototype easy
to use and they can locate their desired information quickly compared to QBO. On the other hand, the
lower ratings for Q2, Q4 and Q6 for QBT specifically show that there is still scope for improvements, as
users feel that they are not able to express all their requirements. After users had been given the freedom
to explore both the prototypes, we see that QBT prototypes in general received higher ratings than the
QBO prototypes. Although the difference is ratings is not highly significant, the difference in ratings
shows the preference of QBT over QBO.
3.4.3.3 Limitations and possible improvements for the usability study
The users for our usability study were a group of agricultural experts working in the IT for Agricul-
ture lab, IIIT Hyderabad. They matched our target audience of users that are unfamiliar with database
systems but are familiar with the data they want to query. However, the usability study could have been
conducted iteratively with different groups of users rather than a single group of agricultural experts.
Another scope for study was having an expert review of the prototypes.
We use a limited set of six questions in the survey. In [68], the authors have described an array of
questions that could be used for a detailed study of user behavior. The usage of other popular ques-
tionnaires like System Usability Scale (SUS) was also an alternative. The questionnaire could have also
included questions that directly compare the two prototypes. Measuring the internal consistency or a
35
reliability score could have been used to validate our questionnaire. We use the mean average ratings
to evaluate our study while other measures like standard deviation and also the correlation between
questionnaire ratings could be studied to have more detailed analysis.
For the task analysis, we made the user perform three similar tasks on two prototypes. While users
completed their first three tasks on the one prototype, they become experienced to complete their tasks
on the other prototype. This creates a bias for one of the prototype. Similarly, we could not let the users
complete the same task on both the prototype, which would have again created a bias.
3.5 Summary of the chapter
Accessing a database requires the user to be familiar with query languages. The QBO approach,
based on IRE framework provides an interface where a user progressively builds queries using multiple
steps. This approach works fine for small databases but cannot perform well for a database consisting
of large number of tables and rows. In this chapter, we propose Query-by-Topics, which provides
enhancements over the existing QBO approach. We exploit topical structures in large databases to
represent objects at a higher level of abstraction. We also organize instances of an object in a two-level
hierarchy based on a user selected attribute. The advantages of this approach includes: user gets less
navigational burden and the number of operations is reduced at the system level. The QBT prototype
was implemented for a real database and experiments were conducted at the system level and user level
to discuss the advantages.
36
Chapter 4
Exploiting Schema and Documentation for Summarizing Relational
Databases
According to a recent study, users take more time to express and formulate their query requirements
compared to the time taken for executing the query and displaying the result [70]. With the increase
in complexity of modern day databases, users spend a considerable amount of time in understanding
a given schema in order to locate their information of interest. To address these issues, the notion of
schema summarization was proposed in the literature [25, 18].
Schema summarization involves identifying semantically related schema elements, representing what
users may perceive as a single unit of information in the schema. Identifying abstract representations
of schema entities helps in efficient browsing and better understanding of complex database schema.
Practical applications of schema summarization are as follows:
• Schema Matching [71, 59] is a well researched issue. Schema matching involves identifying
mappings between attributes from different schemas. After identifying abstract representations of
schema elements, we can reduce the number of mapping identification operations by identifying
mappings at an abstract level rather than schema level.
• In Query Interfaces, users construct their query by selecting tables from the schema. A quick
schema summary lookup might help the user in understanding where his desired information is
located and how is it related to other entities in the schema.
The problem of schema summarization has gained attention recently in the database community.
Existing approaches [18, 19, 20] for generating schema summary exploit two main sources of database
information, the database schema and data stored in the database. In another related work, Wu et al.
[21] described an elaborate approach (iDisc) for clustering schema elements into topical structures by
exploiting the schema and the data stored in the database.
In this chapter, we propose an alternative approach for schema summarization by exploiting the
documentation of the database, in addition to its schema. It can be noted that we investigated how
documentation of the database provides the scope for efficient schema summarization. The database
37
Figure 4.1 TPCE Schema
documentation contains domain specific information about the database which can be used as an in-
formation source. For each table, first we identify the corresponding passages in the documentation.
Later, a table similarity metric is defined by exploiting similarity of the passages describing the schema
elements in the documentation and the referential relationships between tables. Using the similarity met-
ric, a greedy weighted k-center clustering algorithm is used for clustering tables and generating schema
summary. The experimental results on the TPCE [72] benchmark database shows the effectiveness of
the proposed approach.
The rest of the chapter is organized as follows: In section 4.1, we describe the proposed approach
including the basic idea, table similarity measure and clustering algorithm. In section 4.2, we discuss
the experimental results and analysis. Section 4.3 includes conclusions and future works.
4.1 Proposed Approach
We use the TPCE schema [72] described in Figure 4.1 as the running example in this chapter. The
TPCE schema consists of 33 tables that are grouped into four categories of tables: Customer (blue),
Market (green), Broker (red) and dimension (yellow). This categorization is provided by the TPCE
benchmark and it also serves as the gold standard for evaluation of our experiments.
38
Existing approaches for clustering database tables are data oriented, utilizing schema and data in the
database for generating schema summary. In scenarios where the data is insufficient, or some tables
do not contain data, we have to look for alternate sources of information. For example, in the TPCE
benchmark database, if no active transactions are considered, the table trade request is empty and hence,
cannot be considered for clustering in existing approaches. Thus, we investigate alternative sources of
information for a database. Databases are accompanied with the documentation or the requirement
document. These documents contain domain specific information about the database that could be
exploited for generating schema summary. Although one can go through the documentation and infer
the schema summary manually, it is not always feasible to do so. Documentations for enterprise database
are generally large, spanning hundreds of pages. The documentation for TPCE is 286 pages long and
manually going through the documentation will thus be a tedious process for the user.
In the proposed approach, we aim to propose an efficient approach for schema summary generation,
using only schema and the documentation.
4.1.1 Basic Idea
A foreign key relationship between two tables shows that there exists a semantic relationship be-
tween two tables. However, referential relationships alone do not provide good results [20]. Hence,
we attempt to supplement this referential similarity between tables with another notion of similarity,
such that the tables belonging to one category attain higher intra-category similarity. This additional
similarity criterion is based on finding similarity between the passage of text representing the table in
the database documentation. The intuition behind this notion of similarity is that the tables belonging
to the same categories should share some common terms about the category in the documentation. We
combine the referential similarity and document similarity by means of a weighted function and obtain
a table similarity metric over the relational database schema. After pairwise similarity between tables is
identified, we use a Weighted K-Center clustering algorithm to partition tables into k clusters.
We propose a measure for table similarity. The measure has two components: one based on referen-
tial relationship and the other based on similarity of corresponding passages in the documentation. We
first explain about the components and then present the table similarity measure.
4.1.2 Schema based Table Similarity
In a relational database, foreign keys are used to implement referential constraints between two ta-
bles. The presence of foreign keys thus implies that the two tables have a semantic relationship. Such
constraints are imposed by the database designer or administrator and form the basic ground truth on
the similarity between tables. In our approach, referential similarity between two tables R and S is ex-
pressed as RefSim(R,S).
39
Security Daily Market Watch Item
Security - 1 1
Daily Market 1 - 0
Watch Item 1 0 -
Table 4.1 Referential Similarity between tables security, daily market and watch item
RefSim(R,S) =
{1 , If R,S have foreign key constraint
0 , Otherwise.
Example1: Consider the three tables Security, Daily market and Watch item (S, D and W ) in the TPCE
schema. Table security has a foreign key relationship with daily market and watch item, hence
RefSim(S,D) = RefSim(D,S) = 1 and RefSim(S,W ) = RefSim(W,S) = 1. The pairwise
similarity is described in Table 4.1.
4.1.3 Documentation based Table Similarity
In addition to the referential similarity, we also try to infer the similarity between tables using
database documentation as an external source of information. First, we find the passage describing
the table in the documentation using passage retrieval approach. The similarity between two tables thus
corresponds to the similarity between the corresponding passages in the documentation. The passage
from the documentation representing a table Ti is referred to as the table-document of Ti, TD(Ti). The
first task is to identify the table-document for each table from the documentation. Later, we find pairwise
similarity between the table-documents.
4.1.3.1 Finding Relevant Text from the Documentation:
Passage retrieval [73, 74, 75, 76, 77] is a well researched domain. Passage retrieval algorithms return
the top-m passages that are most likely to be the answer to an input query. We use a sliding window
based passage retrieval approach similar to the approach described in [78]. In this chapter, we focus on
using a passage retrieval approach to evaluate table similarity from database documentation rather than
comparing different approaches for passage retrieval from the documentation.
Consider a table Ti with a set of attributes Ai = (Ai1, Ai2..Aik). Given a database documentation
(D), for each table Ti we construct a query Q(Ti) consisting of the table name and all its attributes as
keywords.
Q(Ti) =< Ti, Ai1, Ai2..Aik > (4.1)
In a sliding window based passage retrieval approach, given a window size wi for Ti, we search wi
continuous sentences in the document sequentially for the keywords in Q(Ti). If at any instance, the
40
window matches all the keywords from Q(Ti), the passage in the window is considered a potential table-
document for Ti. In cases where multiple windows are identified, we implement a ranking function [79]
for the retrieved passages and choose the passage with the highest ranking score. The selection of an
appropriate window size is a crucial step as the number of keywords in Q(Ti) varies for each Ti. We
propose two types of window functions (f(Q(Ti))):
• Independent window function, f(Q(Ti)) = c, c being a numeric constant.
• Linear window function, f(Q(Ti)) = a× |Q(Ti)|+ c, c being a numeric constant.
After the passage describing the table is identified, we store the passage in a separate document and
represent it as the table-document TD(Ti) for the table table Ti.
4.1.3.2 Similarity of passages:
Once the table-documents have been identified, we have a corpus containing table-document(s) for
each table. The table-document(s) are pre-processed by removing stop-words and performing stemming
using Porter Stemmer. The table-document can be modeled in two ways:
• TF-IDF Vector: TD(i) = (w1, w2, ..wd) can be represented as a d-dimension TF-IDF feature
vector, where d = |corpus| and wi represents the TF-IDF score for the ith term in TD(i).
• Binary Vector: TD(i) is represented as a d-dimension binary vector TD(i) = (w1, w2, ..wd),
where d = |corpus| and wj is 1 if TD(i) contains the term wj and 0 otherwise.
We then calculate pairwise similarity between table-documents using the cosine similarity measure
or the jaccard coefficient:
DocSimcos(R,S) = DocSim(docR, docS) =docR.docS
|docR| × |docS |(4.2)
DocSimjacc(R,S) = DocSim(docR, docS) =docR ∩ docS
|docR| ∪ |docS |(4.3)
4.1.4 Table Similarity Measure
For two tables R and S, let RefSim(R,S) represent the referential similarity and DocSim(R,S)
represent the document similarity between R and S. We combine the referential similarity and document
similarity using a weighing scheme as
Sim(R,S) = α×RefSim(R,S) + (1− α)×DocSim(R,S) (4.4)
Where α is a user specified parameter called the contribution factor 0 ≤ α ≤ 1. It measures the
contribution of referential similarity to the table similarity. In some cases, two tables have a low value
41
Algorithm 1 Finding Table Similarity
Input: D: Database Schema, TD: Set of Table-Document vectors, S: Document similarity measure,
α: Contribution factor,
Output: Sim: Pairwise similarity between tables in database
RefSim← REFERENCE-SIMILARITY(TD,S)
DocSim← DOCUMENT-SIMILARITY(D)
Sim← α×RefSim+ (1− α)×DocSim
for all tables as k do
for all tables as i do
for all tables as j do
if Sim(i, k) × Sim(k, j) < Sim(i, j) then
Sim(i, j) ← Sim(i, k) × Sim(k, j)end if
end for
end for
end for
return Sim
procedure REFERENCE-SIMILARITY(D)
for all tables as R do
for all tables as S do
if R,S have foreign key relationship in D then
RefSim(R,S)← 1else
RefSim(R,S)← 0end if
end for
end for
return RefSim
end procedure
procedure DOCUMENT-SIMILARITY(TD, S)
for all tables as R do
for all tables as S do
DocSim(R,S)← S(TD(R), TD(S))end for
end for
return DocSim
end procedure
42
of (combined) similarity, but have high similarity to a common neighboring table. For example, in
Figure 4.1, tables account permission(AP ) and customer(C) do not have a referential similarity
but both are similar to the table customer account(CA). In such cases, two tables gain similarity as
they have similar neighbors. For the previous example, similarity between account permission and
customer should be max(Sim(AP,C) , Sim(AP,CA)× Sim(CA,C)).
We construct the undirected database graph G = (V,E), where nodes (V ) correspond to tables in the
database schema. For any two tables R and S, we define an edge representing the combined similarity
Sim(R,S) between the tables. The database graph G is a complete graph.
Consider a path p : R = Ti, Ti+1, Ti+2, ...Tj = S between two tables Ti and Tj . Similarity between
the tables Ti and Tj along path p is
Simp(R,S) =
j−1∏
k=i
Sim(Tk, Tk+1) (4.5)
Then the path with the maximum similarity between R and S gives the complete similarity between R
and S.
Sim(R,S) = maxpSimp(R,S) (4.6)
As we construct a complete graph, we use the Floyd-Warshall algorithm for finding the shortest paths
in a weighted graph. In our case, we define the shortest distance as having the maximum similarity.
Since we construct a complete graph for finding all pairs maximum similarity paths, the algorithm takes
O(n3) running time for this step.
Algorithm 1 describes the procedure for calculating the pairwise similarity between tables in a
schema. By taking the database schema, set of extracted passages, a document similarity measure and
contribution factor as input, the algorithm returns pairwise similarity between tables. First we calculate
the referential and document similarity for O(n2) pairs and later combine them using the contribution
factor. The procedure REFERENCE-SIMILARITY() takes as input the database schema and calculates
the similarity between two tables based on the referential relationships. The procedure DOCUMENT-
SIMILARITY() takes as input the passage corresponding to each table, a document similarity measure
and calculates the similarity between tables based on the similarity of corresponding passages of the
tables. Note that for every table, the passage is extracted by employing the passage retrieval approach
described in Section 4.1.3.
4.1.5 Clustering Algorithm
For generating summary, we use a greedy Weighted K-Center clustering algorithm. It provides min-
max optimization problem, where we want to minimize the maximum distance between a table and its
cluster center.
43
4.1.5.1 Influential tables and Cluster Centers
In schema summarization, the notion of influential table is used for clustering [20]. The notion says
that the most important tables should not be grouped in the same cluster. We measure the influence
of a table by measuring the influence one table has on other tables in the schema [80]. Specifically, if
a table is closely related to large number of tables in the database, it will have a high influence score.
The influence score helps in identifying the cluster centers, described in the clustering process. The
influence of a table R on another table S in the database schema is defined as
f(R,S) = 1− e−Sim(R,S)2 (4.7)
Influence score of a table is thus defined as
f(R) =∑
tiǫT
f(R, ti) (4.8)
where T represents the set of tables in the database.
4.1.5.2 Clustering Objective Function
The clustering objective function aims to minimize the following measure [20].
Q = maxki=1maxRǫCif(R)× (1− Sim(R,Center(Ci))) (4.9)
where k in the number of clusters, f(R) is the influence score of table R and Center(Ci) represents
the center of the ith cluster (Ci).
4.1.5.3 Clustering Process
We use the Weighted K-Center algorithm that considers the influence score for clustering. In this
approach, the most influential table is selected as the first cluster center, and all the tables are assigned
to this cluster. In each subsequent iterations, the table with lowest weighted similarity from its cluster
center separates out to form a new cluster center. The remaining tables are re-assigned to the closest
cluster center. We repeat the process for k iterations, such that k clusters are identified for the database
schema. The time complexity of the greedy clustering algorithm is O(kn2) [81], where n is the number
of tables in the schema.
4.2 Experimental Results
In this section, we present results of experiments conducted on our proposed approach. The following
variables have been used at different stages in our approach:
• Window size function (f ) for table-document discovery.
44
• Document similarity measure (S), for calculating the similarity of passage describing the tables
in the documentation.
• α, the contribution factor in combined table similarity metric.
• k, the number of clusters determined by the clustering algorithm.
Varying any of the variables affects the table similarity metric and clustering. We study the influence
of these variables by varying one variable while keeping the other variables constant. Later, we conduct
experiments on the clustering algorithm and compare our approach with other existing approaches.
4.2.1 Experimental Setup
We used the TPCE database [72], provided by TPC. It is an online transaction processing workload,
simulating the OLTP workload of a brokerage firm. TPC also provides a software package EGen to
facilitate the implementation of the TPCE database. We used the following parameters to implement an
instance of TPCE: Number of Customers = 5000, Scale factor = 36000, Initial trade days = 10.
The TPCE schema consists of 33 tables, which are grouped into four categories: Customer, Market,
Broker and Dimension. We use this categorization as the gold standard to measure the accuracy of our
approach. The dimension tables are not an explicit category, they are used as companion tables to other
fact tables and hence can be considered as outliers to our clustering process. We thus aim to cluster the
other 29 tables any measure the accuracy of these 29 tables to the given gold standard.
In addition, TPCE also provides the documentation for the TPCE benchmark. It is a 286 page long
document and contains information about TPCE business and application environment, the database and
the database transactions involved. This document serves as an external source in the proposed schema
summarization approach.
4.2.2 Evaluation Metric
The accuracy of clustering and table similarity metric is evaluated by means of an accuracy score,
proposed in [20]. The accuracy score has different connotations for clustering evaluation and table
similarity evaluation. For the table similarity metric, we find the top-n neighbors for each table based on
the Sim metric described in Equation (4.6). Unless specifically mentioned, we find the top-5 neighbors
in our experiments. From the gold standard, if category of table Ti is Ca, mi is the count of the tables in
the top-n neighborhood of Ti belonging to the same category as Ca, then average accuracy of similarity
metric is defined as
accsim =
∑iǫT
mi
n
|T | (4.10)
Similarly for clustering accuracy, consider a cluster i containing ni number of tables. If the category of
the cluster center of a cluster i is Ca; let mi denote the count of tables in the cluster that belong to the
45
category Ca. Then accuracy of the cluster i and overall clustering accuracy is
accclusti =mi
ni
(4.11)
accclust =
∑iǫT mi
|T | (4.12)
Figure 4.2 accsim and accclust values on varying window
function, f
Figure 4.3 accsim and accclust values for doc-
ument similarity functions S
4.2.3 Effect of window function (f ) on combined table similarity and clustering
In this experiment we measure the impact of varying the window function f for window size (w)
on the clustering accuracy and table similarity metric. We fix α = 0.5, k = 3 and use the tf-idf based
cosine similarity for table-document similarity. We conduct an experiment with the following window
functions
• wi = f(Q(Ti)) = 10
• wi = f(Q(Ti)) = 20
• wi = f(Q(Ti)) = 2× |Q(Ti)|+ 1
• wi = f(Q(Ti)) = 3× |Q(Ti)|+ 1
The results of this experiments are shown in Figure 4.2. We observe that although the function
f = 20 gives respectable results, it is hard to determine a value of such constant ( f = 10 gives poor
results). Using a constant window size can cause loss of information in some cases or add noise in
other cases. To be on the safe side, linear window functions, which gave comparatively similar results
are preferred. In further experiments, unless specified specifically, we use the window function as
f(Q(Ti)) = 2× |Q(Ti)|+ 1.
46
0 0.2 0.4 0.6 0.8 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
α
acc si
m
Without dimension tablesWith dimension tables
Figure 4.4 Accuracy of similarity metric on varying
values of α
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
α
acc cl
ust
Figure 4.5 Accuracy of clustering on varying values
of α
4.2.4 Effect of document similarity measure (S) on similarity metric and clustering ac-
curacy
The table-documents identified for each table can be of variable length. We study two similarity
measures described in Equation (4.2) and Equation (4.3): Cosine similarity and Jaccard similarity. We
compare the accuracy of similarity metric and clustering algorithm for the two similarity measures for
k = 3, α = 0.5 and f(Q(Ti)) = 2 × |Q(Ti)| + 1. The results of the experiments are shown in Figure
4.3. We observe that the tf-idf score based Cosine similarity measure is consistent with the results.
This can be attributed to the fact the table-documents share a lot of similar terms about the domain
of the document and hence term frequency and inverse document frequency play an important role in
determining the score of the terms in a document.
4.2.5 Effect of contribution factor (α) on table similarity and clustering
In this section we measure the impact of varying α on the clustering accuracy and table similarity
metric. In this experiment, we fix w = 2×|Q(Ti)| and k = 3, while varying α from 0 to 1. Figure 4.4 and
Figure 4.5 show the results of varying α on clustering accuracy and accuracy of similarity metric. One
interesting observation is we achieve the best clustering accuracy when the contribution of referential
similarity and document similarity are almost equal (α = 0.4, 0.5, 0.6). This shows that rather than
one notion of similarity supplementing the other, both similarities have equal importance in generating
schema summary. Also using any single similarity measure (when α is 0 or 1) produces low accuracy
results which verify the claims made in this chapter.
47
2 3 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
k
acc si
m
clusts
clustv
clustd
clustc
Figure 4.6 Clustering accuracy for different clustering algorithms
4.2.6 Comparison of Clustering Algorithm
In this section, we compare clustering algorithms for schema summary. In addition to the proposed
weighted k-center clustering algorithm using an influence function (Clusts), we implement the follow-
ing clustering algorithms:
• Clustc, A community detection based schema summarization approach proposed in [19].
• Clustd, the schema summarization approach proposed in [20]. The clustering approach uses a
table importance metric based weighted k-center clustering algorithm.
• Clustv, Combines results from clustering using reference similarity and clustering using doc-
ument similarity using a voting scheme similar to [21]. This algorithm focuses on combining
clustering from different similarity models rather than combining similarity models.
Figure 4.6 shows the clustering accuracy achieved for k = (2, 3, 4) for various clustering algorithms.
We observed that clusts and clustd achieve almost similar accuracy, with clusts giving slightly higher
accuracy as it was able to successfully cluster the table trade request. If no active transactions are
considered for the TPCE database, the table trade request is empty and data oriented approaches are
unable to classify the table. For the clustv and clustc approaches, no specific patterns were observed.
The reason for low accuracy of clustv is because referential similarity provides a very imbalanced and
ineffective clustering that deters the overall clustering accuracy in the voting scheme significantly.
4.3 Summary of the chapter
Schema summarization has been proposed in the literature to help users in exploring complex database
schema. Existing approaches for schema summarization are data oriented. In this chapter, we proposed
48
a schema summarization approach for relational databases using database schema and the database doc-
umentation. We proposed a combined similarity measure to incorporate similarities from both sources
and proposed a framework for schema summary generation. Experiments were conducted on a bench-
mark database and the results showed that the proposed approach is as effective as the existing data
oriented approaches.
49
Chapter 5
Conclusion and Future work
With the rapid increase in the amount of published information and the explosion of data, users
require sophisticated tools to simplify the task of managing data and extracting useful information in
timely fashion. Subsequently, databases and database systems are essential to every organization for its
business operations.
Accessing information stored in a database requires the user to be familiar with query languages.
Naive users are not skilled at using a general purpose query language like SQL, which has a complex
structure. As a result, research efforts are on to provide easy to use query interfaces with expressive
power comparable to SQL.
The QBO approach, based on IRE framework provides an interface where user progressively builds
a query using multiple steps. The QBO approach works fine for small databases but cannot perform
well on a database consisting of large number of tables and rows. In this thesis, we propose Query-by-
Topics, which provides enhancements over the existing QBO approach. We exploit topical structures
in large databases to represent objects at a higher level of abstraction. We also organize instances of
an object in a two-level hierarchy based on a user selected attribute. The advantages of this approach
include user gets less navigational burden and the number of operations is reduced at the system level.
We also implemented a system prototype for a real database and made efforts to extend it for any
general purpose database. Experiments were conducted at the system level to estimate the reduction
in navigational burden and reduction in the number of operational pairs. A usability study was also
conducted using the system prototype to evaluate our efforts against human factors.
A key step in the proposed approach was to represent schema elements at a higher level of abstrac-
tion. Schema summarization has been proposed in the literature to cluster database schema entities and
to present a high-level abstraction of the schema that help users in exploring complex database schemas.
Existing approaches for schema summarization are data-oriented. In this thesis, we proposed a schema
summarization approach for relational databases by utilizing the database schema and the database doc-
umentation. We proposed a combined similarity measure to incorporate similarities from both sources
and proposed a framework for schema summary generation. Experiments were conducted on a bench-
50
mark database and the results showed that the proposed approach is as effective as the existing data
oriented approaches.
As part of future work, we would like to improve on the limitations of the usability study (mentioned
in section 3.4.3.3). For schema summarization, we would like to come up with approaches to learn the
values of various parameters used in the proposed approach. Also, apart from the database documenta-
tion, documents like the requirement document could be exploited for schema summarization. Lastly,
another research work could focus on developing a unified approach to combine notion of similarity
from schema, data and the database documentation.
51
Related Publications
1. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Exploiting Schema and Documenta-
tion for Summarizing Relational Databases, International Conference on Big Data Analytics,
LNCS Volume 7678, 2012, pp77-99.
2. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Enhanced Query by Object Approach
for Information Requirement Elicitation in Large Databases, International Conference on Big
Data Analytics, LNCS Volume 7678, 2012, pp 26-41.
52
Bibliography
[1] Tiziana Catarci, Maria Francesca Costabile, Stefano Levialdi, and Carlo Batini. Visual query
systems for databases: A survey. Journal of Visual Languages and Computing, 8(2):215–260,
1997.
[2] Moshe M. Zloof. Query by example. In Proceedings of the May 19-22, 1975, national computer
conference and exposition, AFIPS ’75, pages 431–438, New York, NY, USA, 1975. ACM.
[3] Joobin Choobineh. Human Factors in Management Information Systems. Ablex Publishing Corp.,
Norwood, NJ, USA, 1988.
[4] Joobin Choobineh, Michael V. Mannino, and Veronica P. Tseng. A form-based approach for
database analysis and design. Communications of the ACM, 35(2):108–120, February 1992.
[5] Raghu Ramakrishnan and Johannes Gehrke. Database management systems (3. ed.). McGraw-
Hill, 2003.
[6] Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Keyword search in databases: The power of rdbms.
In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’09, pages 681–694, New York, NY, USA, 2009. ACM.
[7] Arvind Hulgeri and Charuta Nakhe. Keyword searching and browsing in databases using banks.
In Proceedings of the 18th International Conference on Data Engineering, ICDE ’02, pages 431–,
Washington, DC, USA, 2002. IEEE Computer Society.
[8] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. Dbxplorer: enabling keyword search over
relational databases. In Proceedings of the 2002 ACM SIGMOD international conference on Man-
agement of data, SIGMOD ’02, pages 627–627, New York, NY, USA, 2002. ACM.
[9] Jun Sun. Information requirement elicitation in mobile commerce. Communications of the ACM,
46(12):45–47, December 2003.
[10] Jun (John) Sun, Hoh Peter In, and Kuncara Aji Sukasdadi. A prototype of information requirement
elicitation in m-commerce. In 2003 IEEE International Conference on Electronic Commerce (CEC
2003), 24-27 June 2003, Newport Beach, CA, USA, page 53, 2003.
53
[11] Subhash Bhalla, Masaki Hasegawa, Enrique Gutierrez, and Nadia Berthouze. Computational in-
terface for web based access to dynamic contents. International Journal of Computational Science
and Engineering, 2(5/6):302–306, August 2006.
[12] S. Bhalla and M. Hasegawa. Query-by-object interface for accessing dynamic contents on the web.
In TENCON ’02. Proceedings. 2002 IEEE Region 10 Conference on Computers, Communications,
Control and Power Engineering, volume 1, pages 310–313 vol.1, 2002.
[13] Takatoshi Akiyama and Yutaka Watanobe. An advanced search interface for mobile devices. In
Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environ-
ments, HCCE ’12, pages 230–235, New York, NY, USA, 2012. ACM.
[14] Shapiee Abd Rahman, Subhash Bhalla, and Tetsuya Hashimoto. Query-by-object interface for
information requirement elicitation in m-commerce. International Journal of Human Computer
Interaction, 20(2), 2006.
[15] Kazumi Nemoto and Yutaka Watanobe. An advanced search system for learning objects. In
Proceedings of the 13th International Conference on Humans and Computers, HC ’10, pages 94–
99, Fukushima-ken, Japan, Japan, 2010. University of Aizu Press.
[16] M. Hasegawa, S. Bhalla, and T. Izumita. A high-level query interface for web user’s access to data
resources. In Frontier of Computer Science and Technology, 2007. FCST 2007. Japan-China Joint
Workshop on, pages 98–105, 2007.
[17] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical
structures of databases. In Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM.
[18] Cong Yu and H. V. Jagadish. Schema summarization. In Proceedings of the 32nd international
conference on Very large data bases, VLDB ’06, pages 319–330. VLDB Endowment, 2006.
[19] Xue Wang, Xuan Zhou, and Shan Wang. Summarizing large-scale database schema using com-
munity detection. In Journal of Computer Science and Technology, volume 27, pages 515–526.
Springer US, 2012.
[20] Xiaoyan Yang, Cecilia M. Procopiuc, and Divesh Srivastava. Summarizing relational databases.
Proceedings of the VLDB Endowment, 2(1):634–645, August 2009.
[21] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical
structures of databases. In Proceedings of the 2008 ACM SIGMOD international conference on
Management of data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM.
[22] Ben Shneiderman. Improving the human factors aspect of database interactions. ACM Transac-
tions on Database Systems, 3(4):417–439, December 1978.
54
[23] C. J. Date. Database usability. In Proceedings of the 1983 ACM SIGMOD international conference
on Management of data, SIGMOD ’83, pages 1–1, New York, NY, USA, 1983. ACM.
[24] Tiziana Catarci. What happened when database researchers met usability. Information Systems,
25(3):177–212, 2000.
[25] H. V. Jagadish, Adriane Chapman, Aaron Elkiss, Magesh Jayapandian, Yunyao Li, Arnab Nandi,
and Cong Yu. Making database systems usable. In Proceedings of the 2007 ACM SIGMOD
international conference on Management of data, SIGMOD ’07, pages 13–24, New York, NY,
USA, 2007. ACM.
[26] Kyu-Young Whang, Art Ammann, Anthony Bolmarcich, Maria Hanrahan, Guy Hochgesang,
Kuan-Tsae Huang, Al Khorasani, Ravi Krishnamurthy, Gary Sockut, Paula Sweeney, Vance Wad-
dle, and Moshe Zloof. Office-by-example: an integrated office system and database manager. ACM
Transactions on Information Systems, 5(4):393–427, October 1987.
[27] E. F. Codd. Relational completeness of data base sublanguages. In: R. Rustin (ed.): Database
Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California, 1972.
[28] Arijit Sengupta and Andrew Dillon. Query by templates: a generalized approach for visual query
formulation for text dominated databases. In Proceedings of the IEEE international forum on
Research and technology advances in digital libraries, IEEE ADL ’97, pages 36–47, Washington,
DC, USA, 1997. IEEE Computer Society.
[29] Michele Angelaccio, Tiziana Catarci, and Giuseppe Santucci. Query by diagram: A fully visual
query system. Journal of Visual Languages and Computing, 1(3):255–273, September 1990.
[30] Antonio Massari, Stefano Pavani, Lorenzo Saladini, and Panos K. Chrysanthis. Qbi: query by
icons. In Proceedings of the 1995 ACM SIGMOD international conference on Management of
data, SIGMOD ’95, pages 477–, New York, NY, USA, 1995. ACM.
[31] Francesca Benzi, Dario Maio, and Stefano Rizzi. VISIONARY: a viewpoint-based visual language
for querying relational databases. Journal of Visual Languages and Computing, 10(2):117–145,
1999.
[32] Norman Murray, Norman Paton, and Carole Goble. Kaleidoquery: A visual query language for
object databases. In Proceedings of the Working Conference on Advanced Visual Interfaces, AVI
’98, pages 247–257, New York, NY, USA, 1998. ACM.
[33] Bin Liu and H.V. Jagadish. A spreadsheet algebra for a direct data manipulation query interface.
In Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, pages 417–428,
2009.
55
[34] Clemente Rafael Borges and Jose Antonio Macıas. Feasible database querying using a visual end-
user approach. In Proceedings of the 2nd ACM SIGCHI symposium on Engineering interactive
computing systems, EICS ’10, pages 187–192, New York, NY, USA, 2010. ACM.
[35] Arnab Nandi and Michael Mandel. The interactive join: recognizing gestures for database queries.
In CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’13, pages
1203–1208, New York, NY, USA, 2013. ACM.
[36] Roberta Evans Sabin and Tieng K. Yap. Integrating information retrieval techniques with tradi-
tional db methods in a web-based database browser. In Proceedings of the 1998 ACM symposium
on Applied Computing, SAC ’98, pages 760–766, New York, NY, USA, 1998. ACM.
[37] Saurabh Sinha, Kirk Bowers, Sandra A. Mamrak, and Ra A. Mamrak. Accessing a medical
database using www-based user interfaces. Technical report, The Ohio State University, 1998.
[38] Magesh Jayapandian and H. V. Jagadish. Automating the design and construction of query forms.
In Proceedings of the 22Nd International Conference on Data Engineering, ICDE ’06, pages 125–,
Washington, DC, USA, 2006. IEEE Computer Society.
[39] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine.
In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages
107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.
[40] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. Xrank: ranked keyword
search over xml documents. In Proceedings of the 2003 ACM SIGMOD international conference
on Management of data, SIGMOD ’03, pages 16–27, New York, NY, USA, 2003. ACM.
[41] Yunyao Li, Cong Yu, and H. V. Jagadish. Schema-free xquery. In Proceedings of the Thirtieth
international conference on Very large data bases - Volume 30, VLDB ’04, pages 72–83. VLDB
Endowment, 2004.
[42] Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. Objectrank: Authority-based
keyword search in databases. In (e)Proceedings of the Thirtieth International Conference on Very
Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 564–575, 2004.
[43] Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, and Aristides Gionis. Automated ranking of
database query results. In CIDR, pages 888–899, 2003.
[44] Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, and Jeffrey Naughton. Combining key-
word search and forms for ad hoc querying of databases. In Proceedings of the 2009 ACM SIG-
MOD International Conference on Management of data, SIGMOD ’09, pages 349–360, New York,
NY, USA, 2009. ACM.
56
[45] Aditya Ramesh, S. Sudarshan, Purva Joshi, and ManishaNaik Gaonkar. Keyword search on form
results. The VLDB Journal, 22(1):99–123, 2013.
[46] Google Directory, http://dir.google.com/.
[47] Open Web Directory, http://dmozcom/.
[48] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm.
PVLDB, 4(12):1466–1469, 2011.
[49] Ricardo Baeza-Yates, Carlos Hurtado, and Marcelo Mendoza. Query recommendation using query
logs in search engines. In Proceedings of the 2004 international conference on Current Trends in
Database Technology, EDBT’04, pages 588–596, Berlin, Heidelberg, 2004. Springer-Verlag.
[50] Zhiyong Zhang and Olfa Nasraoui. Mining search engine query logs for query recommendation.
In Proceedings of the 15th international conference on World Wide Web, WWW ’06, pages 1039–
1040, New York, NY, USA, 2006. ACM.
[51] Arnab Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In Proceed-
ings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07,
pages 1156–1158, New York, NY, USA, 2007. ACM.
[52] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. Snipsuggest:
context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22–33, October
2010.
[53] Holger Bast and Ingmar Weber. The completesearch engine: Interactive, efficient, and towards ir
& db integration. In CIDR, pages 88–95, 2007.
[54] Guoliang Li, Shengyue Ji, Chen Li, and Jianhua Feng. Efficient type-ahead search on relational
data: a tastier approach. In Proceedings of the 2009 ACM SIGMOD International Conference on
Management of data, SIGMOD ’09, pages 695–706, New York, NY, USA, 2009. ACM.
[55] Peter Anick. Using terminological feedback for web search refinement: A log-based study. In Pro-
ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development
in Informaion Retrieval, SIGIR ’03, pages 88–95, New York, NY, USA, 2003. ACM.
[56] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst. Faceted metadata for image search
and browsing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Sys-
tems, CHI ’03, pages 401–408, New York, NY, USA, 2003. ACM.
[57] Paul Brown, Peter J. Haas, Jussi Myllymaki, Hamid Pirahesh, Berthold Reinwald, and Yannis
Sismanis. Toward automated large-scale information integration and discovery. In Data Manage-
ment in a Connected World, Essays Dedicated to Hartmut Wedekind on the Occasion of His 70th
Birthday, pages 161–180, 2005.
57
[58] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI
Mag., 26(1):83–94, March 2005.
[59] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The
VLDB Journal, 10(4):334–350, December 2001.
[60] S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured and structured
data sources. SIGMOD Record, 28(1):54–59, March 1999.
[61] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. Experiences using dike, a system for
supporting cooperative information system and data warehouse design. Information Systems,
28(7):835–865, October 2003.
[62] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema matching with cupid.
In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages
49–58, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[63] Erhard Rahm, Hong-Hai Do, and Sabine Massmann. Matching large xml schemas. SIGMOD
Record, 33(4):26–31, December 2004.
[64] Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. Mining
database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM
SIGMOD international conference on Management of data, SIGMOD ’02, pages 240–251, New
York, NY, USA, 2002. ACM.
[65] Periklis Andritsos, Renee J. Miller, and Panayiotis Tsaparas. Information-theoretic tools for min-
ing database structure from large data sets. In Proceedings of the 2004 ACM SIGMOD interna-
tional conference on Management of data, SIGMOD ’04, pages 731–742, New York, NY, USA,
2004. ACM.
[66] Yannis Sismanis, Paul Brown, Peter J. Haas, and Berthold Reinwald. GORDIAN: efficient and
scalable discovery of composite keys. In Proceedings of the 32nd International Conference on
Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 691–702, 2006.
[67] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011.
[68] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The
VLDB Journal, 10(4):334–350, December 2001.
[69] A. Lund. Measuring usability with the use questionnaire. Usability and User Experience Special
Interest Group, Volume 8, Issue 2, October 2001.
[70] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm.
PVLDB, 4(12):1466–1469, 2011.
58
[71] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI
Magazine, 26(1):83–94, March 2005.
[72] TPCE, http://www.tpc.org/tpce/.
[73] Charles L. A. Clarke, Gordon V. Cormack, D. I. E. Kisman, and Thomas R. Lynam. Question
answering by passage selection (multitext experiments for TREC-9). In Proceedings of The Ninth
Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000,
2000.
[74] Abraham Ittycheriah, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi, and Richard J. Mam-
mone. Ibm’s statistical question answering system. In Proceedings of The Ninth Text REtrieval
Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000, 2000.
[75] Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information
systems. In Proceedings of the 16th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’93, pages 49–58, New York, NY, USA, 1993. ACM.
[76] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative eval-
uation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual
international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR
’03, pages 41–47, New York, NY, USA, 2003. ACM.
[77] Mengqiu Wang and Luo Si. Discriminative probabilistic models for passage based retrieval. In
Proceedings of the 31st annual international ACM SIGIR conference on Research and development
in information retrieval, SIGIR ’08, pages 419–426, New York, NY, USA, 2008. ACM.
[78] C. S. Khoo W. Xi, R. Xu-Rong and E.P. Lim. Incorporating window-based passage-level evidence
in document retrieval. In Journal of Information Science, volume 27, pages 73–80, 2001.
[79] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gat-
ford. Okapi at TREC-3. In Proceedings of The Third Text Retrieval Conference, TREC 1994,
Gaithersburg, Maryland, USA, November 2-4, 1994, pages 109–126, 1994.
[80] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute simi-
larities. Proceedings of the VLDB Endowment, 2(1):718–729, August 2009.
[81] M.E Dyer and A.M Frieze. A simple heuristic for the p-centre problem. Operations Research
Letter, 3(6):285–288, February 1985.
59