Enhancing the Query by Object Approach using Schema ...web2py.iiit.ac.in/research_centres/publications/download... · Enhancing the Query by Object Approach using Schema Summarization

Enhancing the Query by Object Approach using Schema Summarization

Techniques

Thesis submitted in partial fulfillment

of the requirements for the degree of

MS by Research

in

Computer Science Engineering

by

Ammar Yasir

200702005

[email protected]

Center of Data Engineering

International Institute of Information Technology

Hyderabad - 500 032, INDIA

July 2015

Copyright c© Ammar Yasir, 2015

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Enhancing the Query-By-Object approach

using Schema Summarization techniques” by Ammar Yasir, has been carried out under my supervision

and is not submitted elsewhere for a degree.

Date Adviser: Prof. P. Krishna Reddy

Dedicated to my parents

Mrs.Shahar Bano, Mr. Ziaul Hasan and sister Sara Hasan for their everlasting love and

support.

Acknowledgments

This dissertation would not have been written without the constant support and encouragement of

many people.

Firstly, I would like to express my deepest gratitude to professor P. Krishna Reddy, for his expert

guidance. He has supported me throughout my thesis with invaluable discussions and feedback. He also

encouraged me to take up challenging problems and gave me freedom to explore my ideas.

I would also like to thank my colleagues in IT for Agriculture Lab and Center for Data Engineering,

especially M Kumara Swamy and R Uday Kiran sir for their critical comments and constructive sug-

gestions. I would also like to thank my labmates Gowtham Srinivas, Somya, Satheesh for their fruitful

discussions.

I am grateful to all my friends for providing constant support and motivation. Ashray, Abhinav, Rohit

Nigam, Rohit Gautam, Romit, Shubhangi, Sankalp, Ankur Goel, Vinay, Shrikant, Rakshit, Siddharth

and Ankit made my stay in IIIT as one of the best experiences of my life.

Lastly, I am forever indebted to my mother Mrs. Shahar Bano and my father Mr. Ziaul Hasan for

their patience, understanding and encouragement.

v

Abstract

Modern day organizations use databases to manage information for their business operations. Since

the introduction of DBMSs in the mid-1960s, database technology has made significant advances in

terms of functionality and performance. As a result, modern day database systems can process a large

number of complex queries on any database. An important area of database research focuses on improv-

ing the usability of databases. Research efforts are ongoing to develop efficient user interfaces to access

information from databases, focusing not only on the design of user-interfaces but more importantly,

improving the process of user interaction and the underlying architecture.

Information Requirement Elicitation (IRE) was proposed in the literature, which recommends a

framework for developing interactive interfaces, allowing users to access database systems without hav-

ing prior knowledge of a query language. An approach called ‘Query-by-Object’ (QBO) has been

proposed in the literature for IRE by exploiting simple calculator like operations. In QBO, the database

is represented with the help of objects and operators are provided to relate information between objects.

However, the QBO approach was proposed by assuming that the underlying database is simple and con-

tains a few tables, each of small size. Large databases have complex database schemas. Given a large

number of tables in a schema, the number of objects is also large. Locating information of interest and

how it is related to other objects becomes a challenging task for the user. Also, the number of possi-

ble operations between objects increase significantly. In this thesis, we investigate opportunities for a

better organization of options available to the user for interacting with the database without making any

changes to the organization of data at the physical layer. First, we try to determine entities in the schema

that collectively represent a conceptual unit or topic in the database. Similarly, we try to organize in-

stances of an object by organizing them into a hierarchy based on attribute values. The organization of

objects into topics allows the user to relate information at a higher level of abstraction and leverages the

number of operational pairs that needed to be defined in QBO. We also evaluate the research decisions

through system analysis and usability studies, which were conducted with the help of a fully functional

prototype developed for a real, complex database.

An important process in the proposed approach is discovering topical structures in the database

schema. The problem has gained attention recently in the database community as the problem of

Schema Summarization. Schema summarization for a relational database schema is a challenge that

involves identifying semantically correlated elements in a database schema. Research efforts are being

made to propose schema summarization approaches by exploiting database schema and data stored in

vi

viithe database. Existing efforts for schema summarization are data oriented. In scenarios where data is

inconsistent or insufficient, existing approaches suffer. In this thesis, we explore the database documen-

tation as an information source. We aim to utilize the schema and database documentation to provide an

efficient schema summary. We propose a notion of table similarity by exploiting the referential relation-

ship between tables and the similarity of passages describing the corresponding tables in the database

documentation. Using the notion of table similarity, we propose a clustering based approach for schema

summary generation. Experimental results on a benchmark database show the proposed approach, al-

though independent of data stored in the database, is as efficient as the data-oriented approaches.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Overview of Existing Efforts for Access Methods in Database Systems . . . . . . . . . 2

1.2 Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach . . 3

1.2.1.1 Overview of Query-by-Object approach . . . . . . . . . . . . . . . 3

1.2.1.2 Issues with Query-by-Object approach . . . . . . . . . . . . . . . . 4

1.2.1.3 Proposed Enhanced Query-by-Object Approach . . . . . . . . . . . 4

1.2.2 Overview of Proposed Schema Summarization Approach . . . . . . . . . . . . 5

1.2.2.1 Overview of Schema Summarization . . . . . . . . . . . . . . . . . 5

1.2.2.2 Proposed Approach for Schema Summarization . . . . . . . . . . . 6

1.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Innovative Query Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Visual Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Text Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Other Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Schema Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Mining Database Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Topical Structures in Databases . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Schema Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Enhanced Query-by-Object Approach for Information Requirement Elicitation in Large Databases 16

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Query-by-Object Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3 Discovering Topical Structures in Databases . . . . . . . . . . . . . . . . . . . 19

3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2.1 Organization into topics: . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2.2 Facilitating Instance Selection: . . . . . . . . . . . . . . . . . . . . 23

viii

CONTENTS ix3.2.2.3 Defining Operations: . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.3 QBT protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.3.1 QBT Developer Protocol . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.3.2 QBT User protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3.1 Experiment 1, Task Analysis: . . . . . . . . . . . . . . . . . . . . . 32

3.4.3.2 Experiment 2, Use Survey: . . . . . . . . . . . . . . . . . . . . . . 34

3.4.3.3 Limitations and possible improvements for the usability study . . . . 35

3.5 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Exploiting Schema and Documentation for Summarizing Relational Databases . . . . . . . . 37

4.1 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.2 Schema based Table Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.3 Documentation based Table Similarity . . . . . . . . . . . . . . . . . . . . . . 40

4.1.3.1 Finding Relevant Text from the Documentation: . . . . . . . . . . . 40

4.1.3.2 Similarity of passages: . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.4 Table Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.5 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.5.1 Influential tables and Cluster Centers . . . . . . . . . . . . . . . . . 44

4.1.5.2 Clustering Objective Function . . . . . . . . . . . . . . . . . . . . . 44

4.1.5.3 Clustering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.3 Effect of window function (f ) on combined table similarity and clustering . . . 46

4.2.4 Effect of document similarity measure (S) on similarity metric and clustering

accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.5 Effect of contribution factor (α) on table similarity and clustering . . . . . . . 47

4.2.6 Comparison of Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

List of Figures

Figure Page

1.1 The TPCE schema without table categories . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 QBO user protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 The iDisc approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Topical Structure for QBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 QBT user protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 System Prototype Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 CONFIG-DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Traditional calculator versus System Prototype UI . . . . . . . . . . . . . . . . . . . . 29

3.8 System Prototype UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.9 Treemap representation of user’s selection(object and granularity) . . . . . . . . . . . 30

3.10 QBO Approach Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.11 QBT Approach Prototype(with Topic modeling and binning) . . . . . . . . . . . . . . 32

3.12 Average ratings for questions from questionnaire . . . . . . . . . . . . . . . . . . . . 34

4.1 TPCE Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 accsim and accclust values on varying window function, f . . . . . . . . . . . . . . . 46

4.3 accsim and accclust values for document similarity functions S . . . . . . . . . . . . . 46

4.4 Accuracy of similarity metric on varying values of α . . . . . . . . . . . . . . . . . . 47

4.5 Accuracy of clustering on varying values of α . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Clustering accuracy for different clustering algorithms . . . . . . . . . . . . . . . . . 48

x

List of Tables

Table Page

3.1 Operator Matrix for Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 QBO Developer and User Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Within-Topic Matrix 1(WT-I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Between-Topic Matrix(BT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Time taken and number of attempts for each task . . . . . . . . . . . . . . . . . . . . 32

3.6 Query building time results for QBT . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Query building time results for QBO . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Referential Similarity between tables security, daily market and watch item . . . . . . 40

xi

Chapter 1

Introduction

A database is a well-organized collection of related data. For example, an address book which

stores name, phone numbers and addresses of people you know, represents a database. A database

management system (DBMS) is a collection of programs that enables users to create and maintain a

database. The DBMS is a general purpose software system that facilitates the process of defining,

constructing, manipulating, and sharing databases among various users and applications.

Since their introduction in the mid-1960s, DBMSs have enjoyed enormous success. An important

feature of a DBMS is that it offers data independence. Application programs utilizing the database

are insulated from the changes in the way data is structured and stored. A DBMS provides a suite

of sophisticated techniques to store and retrieve data efficiently. It also has a potential for enforcing

standards among database users in a large organization, for example, name and formats of data elements,

terminology, and display formats. DBMS also ensures the security of the database by enforcing access

controls for users and also ensures durability, the recovery of the database in face of failures, errors

of many kinds or intentional misuse. Overall, the prime selling feature of the database approach has

been the reduced application development time. A DBMS provides support for important functions that

are common to all applications accessing data in the DBMS, making application development less time

consuming.

With the rapid increase of published information and the abundance of data, users require sophisti-

cated tools to simplify the task of managing data and extracting useful information in a timely fashion.

To deliver such sophisticated systems, database technology has made great strides in the area of data

storage, transaction management, concurrency control and query interfaces. As a result, modern day

DBMSs can efficiently process a large number of complex queries on any database. Although ad-

vances in database technology have concentrated heavily on functionality and performance, ‘usability’

of databases leaves a lot to be desired. The important aspect while discussing the usability of a database

is not just the design of the user interface, but also more importantly the process of interaction and the

underlying architecture.

1

In this thesis, we have made two contributions, first, we propose enhancements for the Query-by-

Object approach by using schema summarization techniques. Also, we propose an efficient approach

for generating schema summary by utilizing the schema structure and database documentation.

In the remaining part of this chapter, we will first overview existing efforts for providing access

to data in database systems and issues involved in providing efficient data access. Then we give an

overview of our proposed approach in the thesis. Further, we discuss the issue of schema summa-

rization, review the existing approaches for schema summarization and give an overview of proposed

approach for schema summarization. Finally, we mention the major contributions made in the thesis

and organization of the thesis.

1.1 Overview of Existing Efforts for Access Methods in Database Sys-

tems

In this section we discuss some of the common approaches for providing access to data in a database

system:

• Database Query Interfaces : Structured query models like SQL or XQuery are powerful means

of interacting with the database. SQL is a textual language with a simple English-like syntax, and

is widely implemented in most commercial database systems. Alternatively, users can use visual

query systems (VQSs) [1]. VQSs are query systems for databases that use visual representations

to locate information of interest and express related requests. VQSs can be seen as an evolution of

query languages and were aimed to improve the effectiveness of the human-computer interaction.

Query-by-Example [2], for example, allowed users to query a database by creating example tables

in the query interface and has influenced many commercial products like Microsoft Access. Form-

based interfaces are widely regarded as the most user-friendly querying method. A form is a

named collection of objects having the same structure. The structured representation of a query

form is an abstraction of conventional paper forms, therefore users felt at ease with the system.

The system presented in [3, 4] provided visual tools for users to frame queries using forms.

• Keyword Search : Searches are a specialized class of queries [5]. A search consists of keywords

representing the user’s information requirements, and the underlying data is usually a collection

of unstructured documents. A search engine retrieves the documents relevant to the query and

ranks the retrieved documents. The keyword search query mechanism allows users to freely

express their query requirements and coupled with instantaneous response time, makes it easier

to refine queries. Although a mainstay of Information Retrieval (IR) systems, the keyword-search

approach has been extended to databases domain as well [6]. Systems such as BANKS [7] and

DBXplorer [8] provide an IR-style keyword-base search engine over relational data.

2

• Information Requirement Elicitation : In m-commerce environment, the ‘Information Require-

ment Elicitation’ (IRE) approach and its conceptual design was proposed by Sun [9]. IRE de-

scribes an interactive communication in which information systems help users to specify their

requirements with adaptive choice prompts. Users initiate IRE sessions by expressing their needs.

In an IRE enabled system, there is an IRE component, which gets triggered upon receiving user’s

request. The IRE component checks whether the information requirement is specific enough. If

not, the component generates choice prompts for the missing elements by utilizing user inputs,

user context, and user preference. The loop continues until the required request information can

be provided to the user. A prototype of IRE in an imagined m-commerce scenario is demonstrated

in [10].

1.2 Overview of Proposed Approach

Structured query models like SQL/XQuery are very efficient in the context of expressing a query.

However, these models require a user to specify a query using a fixed syntax, have a prior knowledge

of the database structure and model and express the query in terms of that particular structure. Novice

users are not skilled at using SQL like query languages as they support a complex structure. While

VQSs offer a friendlier approach, systems like QBE do not perform well with large schema. Secondly,

a user needs to be aware of values of the database to fill the example tables. Another challenge for the

user is grasping the join relationships between data entities to express complex queries. Similarly, form-

based interfaces, although convenient for the users, pose a limitation on the number of queries that can

be executed. Keyword-search approach is not entirely effective as users express queries with complex

semantics and expect precise, complete results. The IRE approach proposed a grand framework whose

potential has not been fulfilled yet. Based on the notion of IRE, Query-by-Object (QBO) approach was

proposed for developing query interfaces. In this thesis, we propose enhancements to the existing QBO

approach, to design user interfaces efficiently for large databases. Another important area in the context

of database usability is to generate the summary for a complex database schema. As part of this thesis,

we also propose techniques to generate efficient schema summary.

1.2.1 Overview of Proposed Approach for Enhanced Query-by-Object Approach

In this section, we present a brief overview of the QBO approach and the challenges involved in

developing user interfaces based on the QBO approach. Later we describe an overview of the proposed

approach for an enhanced query-by-object approach.

1.2.1.1 Overview of Query-by-Object approach

IRE uses a series of steps to elicit information from users where each step adds to the information

about the user’s intent. However, IRE does not permit users to utilize the results of intermediate queries

3

to progressively build complex queries. Based on the notion of IRE, the Query by Object (QBO) in-

terface was proposed by Bhalla et al. in [11, 12] for m-Commerce environment. In this system, users

communicate with a DBMS through a web interface. The user’s intent is captured via objects and path

navigation through an option-based interface. In the end, a query is formulated and executed at the

DBMS server by converting it into its SQL equivalent.

Initially, the user is presented with an object menu. Users perform navigation operations and select

one or two objects at desired level of granularity. Unlike IRE, the QBO approach supports closure

property, which facilitates that each step executes and uses its result in the next step. It allows users to

gather and combine query results. It allows users to search for information in a logical way, whereby

intermediate results are refined or combined to get the intended result. Query-By-Object approach has

been used to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems

[15]. An empirical study was conducted in [14] to evaluate user interaction through the QBO interface.

The study showed that the QBO approach is easy, intuitive and simple to use for common users.

1.2.1.2 Issues with Query-by-Object approach

Designing user interfaces based on QBO approach to provide information access for a general pur-

pose database is a challenging issue [16]. The QBO approach uses a database to store objects and

operations, where each object corresponds to a relation in the schema. Developing user interfaces based

on QBO approach becomes challenging when the complexity of the underlying database (schema and

data) increases. Large number of tables in the schema makes it harder for the user to locate his/her in-

formation of interest and how it is related to other elements in the schema. Similarly, this issue is more

important when the object instances are large in number. Hence, there is a need for better organization

of options available to the user in the QBO interface. Also with the increase in the number of tables, the

number of pairwise operations between tables increase significantly.

1.2.1.3 Proposed Enhanced Query-by-Object Approach

To address the issues of QBO, first, we exploit the notion of detecting topical structures in databases

to represent the schema at a higher level of abstraction. Identifying topical structures allows tables which

are semantically correlated to be grouped together, which provides a better organization for options pre-

sented to the users. We use an elaborated approach called iDisc [17], which utilizes the database schema

structure and the data stored in the database to generate a clustering of schema entities, representing the

topical structure of the database. We discuss the iDisc approach in detail in Chapter 3. Secondly, in-

stead of defining operations between each pair of tables, we can define operations between topics and

within topics which can reduce the number of pairs for which operators need to be defined. Similarly,

to facilitate easier instance selection, we allow selection of instances based on attribute values of the

table. Later, we organize instances of an attribute into bins, providing a two-level hierarchy for instance

selection. The developer protocol is modified to include steps required to generate the abstract lev-

4

Figure 1.1 The TPCE schema without table categories

els. Consequently, the user protocol is also modified for the proposed approach. We also discuss the

engineering of a prototype based on the proposed approach.

1.2.2 Overview of Proposed Schema Summarization Approach

In this section, we briefly describe the issue of schema summarization, the approaches proposed in

the literature and the proposed approach for schema summarization.

1.2.2.1 Overview of Schema Summarization

Detecting topical structures of the database schema is an interesting challenge. In the literature, the

term schema summarization has been used interchangeably with detection of topical structures. Modern

enterprise databases consist of hundreds of interlinked tables. While users are accustomed to data being

represented in two-dimensional tables, grasping joins between tables is a challenge for general users.

For example, Figure 1.1 describes the schema diagram of the TPCE benchmark database. The TPCE

database stimulates the working of an online brokerage firm. Although being morderate in terms of

schema size, complex relationships in the schema makes it difficult for users to familiarize themselves

5

with the database schema. As complexity of database schemas increase, the amount of time spent on

understanding the metadata and schema structure becomes significant.

Database normalization is a process of analyzing the given relation schemas based on their functional

dependencies and primary keys to achieve the desirable properties of (1) minimize redundancy and (2)

minimize the insertion, deletion and deletion anomalies. Unsatisfactory relational schemas that do not

meet the normal form tests are decomposed into smaller relation schemas the meet the tests and hence

achieve the desirable properties. However, through the process of normalization what users perceive

as a single independent unit of information is disintegrated into smaller relations. Coupled with odd

naming conventions for tables, this makes it harder for a user to locate his information of interest easily.

Schema summarization has been proposed to assist users in understanding a complex database schema

easily.

A schema summary represents a higher level of abstraction of the database schema. A user is initially

presented with a few important concepts from the database. Subsequently, a user can zoom into sections

of the schema in which he is interested. Generating a schema summary involves identifying semantically

correlated elements in the schema. Existing approaches [18, 19, 20, 21] exploit the schema structure and

data stored in the database to generate schema summary with a clustering based approach. In scenarios

where the data stored in databases is insufficient, the existing approaches suffer. In this thesis, given a

current snapshot of the database (schema), we investigate the database documentation as an additional

source of information and propose an algorithm to generate summary by exploiting the database schema

structure and documentation.

1.2.2.2 Proposed Approach for Schema Summarization

A foreign key relationship between two tables shows that there exists a semantic relationship be-

tween two tables. However, referential relationships alone do not provide good results [53]. Hence we

attempt to supplement this referential similarity between tables with another notion of similarity, such

that the tables belonging to one category attain higher intra-category similarity. The additional similarity

criteria is based on finding similarity between the passage of text representing the table in the database

documentation. The intuition behind this notion of similarity is that the tables belonging to the same

category should share some common terms about the category in the documentation. We combine the

referential similarity and document similarity by using a weighted function and obtain a table similarity

metric over the relational database schema. After pairwise similarity between tables is identified, we

use a Weighted K-Center clustering algorithm to partition tables into k clusters. Experiments conducted

on a benchmark database show that the proposed approach is as effective as the existing data-oriented

approaches.

6

1.3 Thesis Contribution

• Proposed enhancements to the existing QBO approach by detecting topical structures in the

databases.

• Presented an advanced system to query relational databases, based on the enhanced QBO ap-

proach.

• Explored the database documentation as a source of information for generating schema summary

and proposed an algorithm to exploit the database schema and documentation to generate efficient

schema summary.

1.4 Thesis Organization

In the next chapter, we discuss the QBO approach and the issue of designing QBO interface for com-

plex databases. We discuss proposed enhancements over the existing QBO approach and also discuss

results from system and user level evaluation of the system. We also study an advanced system for

querying relational databases, the tool used in usability evaluations of the approach proposed in Chapter

3. In Chapter 4, we present the problem of summarizing a relational database and propose an algorithm

to generate schema summary by utilizing schema and documentation. We also discuss a thorough ex-

perimental evaluation of the proposed approach. Chapter 5 presents the summary of the work discussed

in the thesis, conclusions, and the future work.

7

Chapter 2

Related Work

One of the earliest works in the field of database usability [22], focused on analyzing the expressive

power of a declarative query language SEQUEL, in comparison to natural language. However, the

importance of usability in database systems was first addressed in [23]. Since then, most of the research

efforts in the context of database usability have been focused on developing innovative query interfaces.

In [24], the author describes the initial enthusiasm and user-induced frustration in building of interactive

information systems.

In 2007, Jagadish et al. [25] provided a second wind to the research domain, discussing a set of

five ‘pain points’ on why database are so difficult to use. The first pain point describes how complex

schema structure makes it hard for the users to locate their information of interest and construct relevant

queries. The authors propose that an abstraction of the presentation data model is needed to allow users

to structure information in a natural way. As users have different views on the organization of data in a

database, various personalized presentation models are developed for different class of users. However,

when users are presented with multiple views, they do not understand the underlying difference between

the views and tend to become confused and lose trust in the system. This issue is discussed as the

second pain point in the context of database usability. The third pain point deals with the issue of

users getting an unexpected result or being unable to query without getting any explanation from the

database system. The fourth pain point describes that the existing query interfaces are not modeled as

WYSIWYG (What-You-See-Is-What-You-Get), which is a desired quality in any user interface. The

last pain point discusses that the creation of a database is a challenging task for novice users and is a

reason why a lot of modern day information is not present in databases. The authors later introduced a

presentation data model for direct data manipulation with a schema later approach.

An important aspect of usability in databases is to provide information access with minimal efforts

to database end-users. In the literature, various visual query systems and textual interfaces have been

proposed to provide efficient data access. We review some of the prominent works in the field of query

systems in Section 2.1. In the context of improving database usability, generating schema summary

for complex database schemas has also received attention of late. We review some of the proposed

techniques for generating efficient schema summary in Section 2.2.

8

2.1 Innovative Query Interfaces

2.1.1 Visual Interfaces

Using visual representations for query specification is perhaps the most researched field in the context

of database usability. Query-by-Example (QBE) [2] was one of the first graphical query languages

with minimum syntax developed for database systems. QBE and its successor Office-by-Example [26]

were both based on domain relational calculus [27]. In QBE, rather than specifying a query using a

fixed syntax; the query is formulated by filling templates of the relations, displayed visually on the

computer screen. The inputs to the template can be translated into an SQL equivalent and executed on

the database. QBE does not require the knowledge of any syntactic constructs or the internal structure

of the database to use, as users are presented only the equivalent table skeletons. QBE is relationally

complete. With some additional commands, condition boxes and other constructs, users can express

all queries that belong to the class of relational algebra. It has been an influence on developing visual

querying facilities in products like Microsoft Access, IBM Visual XQuery Builder, Borland’s Paradox

and open source tools like query builder for phpMyAdmin.

Query by Templates (QBT) [28] was a generalization of QBE language for databases modeled with

SGML. QBT incorporates the structure of the documents for composing powerful queries by displaying

a template for a representative entry in the database. The template describes the type of data expected

in the database. The user specifies examples of data in the template and system retrieves data matched

by the user-specified template, similar to QBE. QBT allows various templates like flat templates, nested

templates and structured templates, unlike QBE where the table skeleton is the only available template.

Query-by-Diagram (QBD*) [29], is a visual query system that allows navigation based on abstrac-

tions of the E-R semantic model. QBD* allows users to extract information from the database without

worrying about the logical model of the schema. The process of query formulation in QBD* is as fol-

lows: The query structure is based on the selection of a main concept, which is the first entity selected

by the user. The user then performs navigation on the ER model to select paths starting from the main

concept. The path represents a subquery that selects a subset of instances of the main concept. Set-

based operations like union, intersection and difference are available to combine various subqueries.

The main feature of QBD* is that it provides a graphical mechanism capable of expressing recursive

queries (transitive closure).

Query-by-Icon (QBI) [30] provides an icon-based visual query system capable of querying and ex-

ploring databases. QBI provides an interface with pure iconic specifications, without the use of any

diagrams. A user perceives the underlying database as a set of classes, each having several properties

called generalized attributes (GA). Generalized attributes encapsulate and hide from the user the details

of specifying a query. To construct a query, users select compatible classes via their corresponding

icons. When users select a class, its GAs are used to define the selection condition. Similarly, the user

also selects GAs, which will be part of the output result. Query results are saved to be explored further

9

in the construction of complex queries. A comparison study of QBD* and QBI [1] suggested that expert

users perform better using the QBD* system while QBI performed slightly better for non-expert users.

VISIONARY [31] is a visual query language, based on diagrammatic paradigm like QBD*. In

VISIONARY, a vision represents the external data model that uses a combination of icons and text to

provide visual primitives of concepts and associations, which is represented by a name and multiplicity.

Users formulate queries by choosing a primary concept, the selection predicates and the attributes to be

retrieved in output. If the interpretation given to a query is not the one the user had in mind, the user can

force a different interpretation by disabling some associations and enabling others. The internal data

model is relational, using SQL as the query language. An intermediate graph-based model provides

mapping between the visual and the relational models.

Kaleidoquery [32] is a powerful visual query language for object databases, supporting the capabil-

ities of the OQL language. Kaleidoquery uses a visual filter flow model, where filters are used to filter

out information of interest for users. The class instances are considered as information flow and using

constraints on the class attributes, information is filtered. The output of the query is then examined, or

it flows into other queries to be further refined. Kaleidoquery separates the tasks of writing the query

constraints and organizing the structure and ordering of the results, providing a more dynamic evolution

of queries than OQL.

Liu and Jagadish [33] designed a spreadsheet algebra for the relational database that continuously

presents data to users in a WYSIWYG (What You See Is What You Get) manner. By dividing query

specification into progressive refinement steps, users can extend intermediate results to construct com-

plex queries. The data manipulation actions are reversible, and users can modify an operation specified

earlier without redoing the later operations. Users can also specify at least all single-block SQL queries

without being exposed to complex database concepts. Non-technical users benefit from the direct ma-

nipulation interface as it allows easier and more accurate specification of queries.

VISQUE [34] describes a visual interaction language by exploiting End-User Development tech-

niques, web-based user interface design and data models. VISQUE uses knowledge visualization tech-

niques like a tree-based metaphor to represent a multidimensional database schema and also allows

construction of complex SQL like using set-based, nested and aggregation queries.

Due to the popularity of touch-based and motion tracking device, research efforts have been made

to design user interfaces that allow gesture-based querying over relational databases [35]. The database

query interface allows users to manipulate results directly by interacting with them in a sequence of ges-

tures. Corresponding to each table, a view is created in the workspace that can be directly manipulated.

Each gesture denotes a single manipulation action and impacts only the view. User needs to learn few

gestures that correspond to an action. Users can undo each action to return to the previous workspace

state. Each action corresponds to the execution of a specific SQL query. Actions are stackable and can

be performed in sequence, manipulating tables in the workspace till the desired result is achieved.

Application developers designing query interfaces for a specific purpose prefer to use form-based

interfaces [4, 3]. In form-based interfaces,the user is presented with a list of searchable fields, each with

10

an entry area that can be used to indicate the search string. To pose a query, the user needs to fill in the

areas of the form relevant to their search. The form-based approach is especially relevant as end-users

are accustomed to manual form-based work.

In [4] the authors study a simple form model that includes hierarchically structured forms with an

event-driven routing. To assist users in the creation of forms for view definitions, an inference com-

ponent was provided to create view definitions consisting of the hierarchical structure and functional

dependencies among form fields. The inference component uses a collection of rules and heuristics

along with a purposeful dialog. The Expert Database Design System [4] assists a designer in the view

integration process. The system provides rules for incrementally integrating the form views and heuris-

tics for mapping the form fields into entity types and relationships. Some other form-based systems for

databases are, the GRIDS system [36], which allowed users to pose queries in a semi-IR fashion and

Acuity Project [37], which used form generation techniques for data-entry operations such as updating

tables in a relational database.

In [38], the authors tried to automate the process of construction of query forms. With the help of

a limited number of forms, the system can express a wide range of queries which helps in leveraging

the restriction on expressiveness posed by form based querying mechanisms. Given a set of interesting

queries, similar queries are identified and subsequently clustered to represent queries that can be queried

using a single form.

2.1.2 Text Interfaces

With the explosion of data availability on the web and the ease of access to data through search

engines, we observe databases playing second fiddle in terms of popularity. Search engines, for example,

Google [39], allow users to issue keyword-based queries freely and coupled with instantaneous response

time allows for a satisfactory experience for the user. While there are still room for improvements, the

success story of the web search engines suggests that any data management system is more useful if

users can extract information from the system with minimal efforts.

Keyword searches in databases [6] allow users to query databases using a set of keywords. The

BANKS system [7] integrates keyword querying and interactive browsing of databases. BANKS models

a database as a graph, where tuples correspond to nodes, foreign key and other links between tuples

correspond to edges. Answers to a query are modeled as rooted trees connecting tuples that match

individual keywords in the query. Answers are ranked using a notion of proximity coupled with a

notion of prestige of nodes based on inlinks, the latter being inspired by techniques developed for web

search. Another keyword-search system DBXplorer [8], uses a symbol table that is used at search time

to determine efficiently the locations of query keywords in the database. Given a set of keywords, the

symbol table is looked up to identify database tables and all potential subsets of tables in the database

that might contain rows having all keywords, are identified and enumerated. For each enumerated join

tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects the rows

that contain all keywords. The system then presents the final rows to the user.

11

Keyword search has also been extended for XML databases. The aim of such systems is to identify

the smallest element that contains most of the keywords [40] or the smallest element that is meaningful

[41]. In [42], the authors describe ObjectRank which uses a metric of authority transfer on a data graph

to improve result quality for ranking results in keyword searches in the database. Ranking of SQL query

results has been studied in [43] using probabilistic models.

Research efforts have also focused on combining form-based approaches and keyword search. Given

a set of keywords, a system retrieves a set of forms instead of query results [44, 45]. The systems

create inverted SQL queries from the SQL queries in the forms. Unlike traditional keyword search on

databases, the techniques do not require any special purpose indices, and instead make use of standard

text indices supported by most database systems.

Some information systems use a ‘page-and-link’ approach for accessing data resource, for example,

a web directory. A Web Directory is a repository of Web pages that are organized into a topic hierarchy.

Typically, directory users locate the information sought simply by browsing through the topic hierarchy,

identifying the relevant topics and finally examining the pages listed under the relevant topics. Some of

the common web directories include [46, 47]. Users select a related link as per their needs. Each link

helps users in narrowing down to his information needs.

2.1.3 Other Works

In [48], the authors proposed a new paradigm for data interaction called guided interaction, which

uses interaction to guide the users in query construction, query execution and result in examination

processes. The authors mandate that databases should be responsive to the user, and all possible actions

are enumerated so as to allow discovery and exploration. The database can also preemptively deliver

insights to aid in query construction. The proposed paradigm supports its value for any general database

interaction interface, whether that be SQL-writing, form-filling, keyword-typing or any other interface.

The authors suggested how information in the database could be leveraged to guide a user during query

construction by following these core principles.

Query recommendation is a popular feature in modern systems, especially search engines. These

recommendations are built by mining search query logs from existing users [49, 50]. The method

proposed in [49] is based on a query clustering process that identifies semantically similar queries by

exploiting historical preferences of registered users. The method also ranks the semantically correlated

queries. In [50], the authors model search engine user’s sequential search behavior, representing it as

query refinement process. This model is combined with a traditional content-based similarity method

to compensate for the high sparsity of real query log data. In [51], the concept of auto-completion

was proposed to rapidly suggest predicates to the user to create conjunctive SELECT-PROJECT-JOIN

queries. In [52], authors proposed a method to mine SQL query logs and identify potential query

templates.

In complete-search [53], Bast et al. modify the inverted index data structure to provide incrementally

changing search results for document search. TASTIER [54] provides find-as-you-type in relational

12

databases by partitioning the relation graph. In the information retrieval area, Anick et al. [55] achieve

interactive query refinement by extracting key concepts from the results and presenting them to the user.

Faceted search [56] extends this to present the user with multiple facets of the results, allowing for

mixing of search and browse steps.

2.2 Schema Summarization

2.2.1 Schema Matching

Information integration in an important challenge in data management [57, 58]. Schema match-

ing [59] involves identifying semantic correspondences or mappings among attributes from different

databases. In [60, 61], authors describe schema oriented approaches for finding correlated schema

elements using name, description, relationship and constraints. In [62] authors have proposed an inte-

grated approach, using linguistic approaches and structure matching process. [63] proposed a fragment

oriented approach for matching large schemas to reduce the matching complexity. Identifying mappings

is analogous to finding similarity between schema elements belonging to two different schemas.

2.2.2 Mining Database Structures

Mining database structure has received attention recently [64, 65, 66]. Bellamn [64] performs data

mining on the database structure and identifies attributes with similar values, discovers join relationship

among tables while also identifying direction and sizes. Such analysis can help in preparing data for data

mining or for identifying foreign keys for schema mapping. In [65], the authors addressed the problem

of mining a data instance for structural clues. The structural clues helps in identifying data instances

that may contain errors, missing values, and duplicate records that may ultimately be helpful in data

design. The authors proposed a set of information-theoretic tools that identify structural summaries that

are useful for characterizing the information content of the data.

2.2.3 Topical Structures in Databases

Wu et.al. [21] proposed an elaborate approach, iDisc, to discover topical structures in relational

databases. The approach first models the database in three representation, graph based, vector-based and

similarity-based. Graph-based representation models the database as a graph where the tables represent

the nodes in the graph, and foreign key relationships represent the edges. In vector-based representation,

each table is modeled as a document and hence the database represents a collection of documents.

Similarity-based representation computes a similarity matrix by considering the similarity of attributes

between schema elements. iDisc then performs clustering on each of the database representations and

then combines the clustering using a voting scheme to generate topical structures.

13

2.2.4 Schema Summarization

The problem of schema summarization was coined by Jagadish et al. in [18]. The proposed approach

for generating schema summary utilized abstract elements and abstract links. Each abstract element rep-

resents a cluster of original schema elements, and abstract link represents one or more link between the

schema elements within those abstract elements. The authors used the notion of summary importance

and summary coverage to generate schema summary representing important schema elements with a

broader coverage.

The approach in [18] was proposed in the context of XML schema. The assumption made in [18]

could not be applied to relation schema. Yang et al. [20] proposed an improved algorithm for relational

schema summarization. The authors proposed a new definition for the importance of tables in a rela-

tional database schema based on information theory and statistical models. The authors also described

a novel distance function that quantified the similarity between elements in the schema. Based on the

distance function, a clustering based approach was proposed for generating schema summary.

In [19], the authors apply the technique of community detection in social networks for schema sum-

marization. The approach partitioned the database schema elements into subject groups by using mod-

ularity based community detection. By utilizing the table importance measure proposed in [20], a

hierarchal clustering algorithm was proposed for multi-level navigation structure in schema summary.

The schema summary described foreign-key relationships, subclass relationships and overlap of data

instances.

2.3 Discussion

Although VQSs like QBE and its derivatives are relationally complete and use user-friendlier that

SQL/XQuery, they still require prior knowledge of schema structure and a grasp of join relationships

between tables to some extent. Query interfaces like the form-based interfaces restrict the number of

queries you can construct for data access. In the proposed effort, we aim to provide an easy to use

interface for novice users, which can leverage the number of queries a user can execute on the system.

In keyword search systems, although users are content with querying using keywords, they need to

express more complex query semantics. Also, users expect more precise and complete answers to their

queries while keyword search based systems may return many irrelevant results without any explanation.

A similar scenario is experienced by the query recommender system. In the proposed approach, we

emphasize on providing precise and complete answers like the structured query language.

Schema matching involves identifying semantic correspondences or mappings among attributes from

different databases whereas the proposed approach identifies semantically correlated elements within

a schema. In [64, 65], the aim was to identify semantic relationships (foreign key) between tables.

The proposed approach aims to identify clusters of strongly correlated schema elements. The existing

schema summarization approaches [18, 20, 19, 21] are data oriented, utilizing schema and data available

14

in the tables. In contrast, the proposed approach uses schema information and database documentation

to generate schema summary.

15

Chapter 3

Enhanced Query-by-Object Approach for Information Requirement

Elicitation in Large Databases

Databases are more useful when users can extract information from the database with minimal ef-

forts. Most database systems provide powerful, structured query models like SQL to query the database.

However, these models require users to specify an unambiguous query explicitly using a fixed syntax

and have a prior knowledge of the database structure, which is unfavorable for novice users. Hence, al-

ternate query interfaces are required for information access that are more suited to the skills of a novice

user yet still provide expressive power like SQL. Research efforts are going on to design efficient query

interfaces that simplify the process of accessing information stored in a database.

Information Requirement Elicitation (IRE) [9] proposes an interactive framework for accessing infor-

mation. IRE proposes that user interfaces should allow users to specify their information requirements

using adaptive choice prompts. In the literature, Query-By-Object (QBO) approach has been proposed

to develop user interfaces for mobile devices [13], GIS systems [14] and e-learning systems [15] based

on IRE framework. The QBO approach provides a web-based interface for building a query using mul-

tiple user level steps. The main advantage of this approach is simplicity to express a query. The QBO

approach uses a database to store the objects and entities. However, for databases with large number of

tables and rows, the QBO approach does not scale well.

In this chapter, we propose an improved QBO approach, Query-by-Topics (QBT), to design user

interfaces based on IRE framework that works on large relational databases. In the proposed approach,

we represent the objects at a higher level of abstraction by clustering database entities and representing

each cluster as a topic. Similarly, we organize instances of an entity in groups based on values of a user-

selected attribute. The aim of this chapter is not to propose an approach for detecting topical structures

but rather how such an approach can provide applications in practical scenarios like information systems.

Experiments were conducted at the system and user level on a real dataset using a QBT based prototype

and the results obtained are encouraging.

The rest of the chapter is organized as follows. In Section 3.1, we explain the QBO approach and

discovering topical structures in a database. In Section 3.2, we present the proposed framework. In

Section 3.3, we discuss the prototype development based on the proposed approach. In Section 3.4,

16

Figure 3.1 QBO user protocol

we present experiments and analysis of the proposed approach. The last section contains summary and

conclusions.

3.1 Background

In this section, we explain the Query-By-Object Approach (QBO) in detail and also describe the

framework for discovering topical structures in databases.

3.1.1 Query-by-Object Approach

‘Information Requirement Elicitation’ [9] framework allows users to build their queries in a series

of steps. The result of each step is used to determine the user’s intent. Based on the notion of IRE,

the Query-By-Object (QBO) approach was proposed in [14]. In this approach, the user communicates

with a database through a high-level interface. The initial intent of the user is captured via selection of

objects from an object menu. The user navigates to select the granularity of these objects and operators

to operate between the selected objects. The user’s actions are kept track in a query-bag, visible to the

user at all stages. Finally, an SQL equivalent query is formulated and is executed at DBMS server. In

the IRE framework, intermediate queries cannot be utilized further and hence, there is not much support

for complex queries. In QBO, user is allowed to gather and combine query results. This is supported by

closure property of the interface. It states that the result of an operation on objects leads to the formation

of another object. Hence, the results of a query can be used to answer an extended query. As the QBO

interface involves multiple user level steps, non-technical users can easily understand and use the system

for retrieving information from the databases. The developer protocol and user protocol (Figure 3.1) for

the QBO approach are as follows:

3.1.2 Example

Consider an example where a developer builds a QBO based system that users will query.

System development based on QBO Developer Protocol: The following steps are taken by the devel-

oper:

17

film actor fim actor

film U, I, C R R

actor R U, I, C R

flim actor R R U, I, C

Table 3.1 Operator Matrix for Example 1

QBO Developer Protocol QBO User Protocol

1. Select an object.

1. Store objects and entities in a RDBMS. 2. Select granularity of object.

2. Define operators for each pair of objects. 3. Select another object.

3. Provide IRE based object selection, operation 4. Select the operator.

selection and support for closure property. 5. Display result.

6. If required, extend query by selecting

another object.

Table 3.2 QBO Developer and User Protocols

• Database:

– film - (film id, film name, film rating)

– actor - (actor id, actor name)

– film actor - (fim id, actor id, actor rating);

• In this approach, the relations in the entity-relationship (ER) data model are considered as objects.

Next, the developer defines pair wise operations between these objects. Four types operators were

proposed: UNION (U), INTERSECT (I), COMPLEMENT (C) and RELATE (R). The ‘RELATE’

operator has different connotations depending on the chosen objects it operates on. The pairwise

operations are shown in Table 3.1.

• A web-based interface provides a list of objects, instances and operations user can select from.

The system also allows the user to combine query responses.

Steps taken by the user based on QBO User Protocol: Consider an example query that the user is

interested in, Find all actors who have worked with the actor ‘Jack’. Such query can be expressed with

QBO as: Find names of films actor ‘Jack’ has worked in, then find names of actors who worked in these

films. User level steps are:

• Select object: actor

• Select granularity: actor-‘Jack’

• Select another object: film

18

Figure 3.2 The iDisc approach

• Select operator: Relate

• Select another object: actor

• Select operator: Relate

• Display result

3.1.3 Discovering Topical Structures in Databases

Discovering topical structures in databases allows us to group semantically related tables in a single

group, helping in identifying what users might perceive as a single unit of information in the database.

Consider a database D, consisting of a set of tables T = {T1, T2...Tn}. Topical structure of D describes

a partitioning, C = {C1, C2, ..Ck} of tables in T such that the tables in the same partition have a

semantic relationship and belong to one subject area. In [17], the authors proposed iDisc, a system

which discovers topical structure in a database by clustering tables into quality clusters. Clustering [67]

is the process of grouping a set of data objects into multiple groups or clusters so that objects within a

cluster have high similarity, but are very dissimilar to objects in other clusters.

The iDisc approach is described in Figure 3.2. The input to iDisc is D consisting of a set of tables

T and returns a clustering C of the tables in T . In the iDisc approach, a database is first modeled by

various representations namely vector-based, graph-based and similarity-based.

In the vector-based model, each table is represented as a document in bag-of-words model and a

database is hence represented as a set of documents. In the graph-based model, the database is repre-

19

sented as an undirected graph. The nodes in the graph are the tables in the database (T ). Two tables Ti

and Tj share an edge in the undirected graph if there exists a foreign key relationship between Ti and

Tj . In the similarity-based representation, a database D is represented as a n × n similarity matrix M ,

where n = |T | and M [i, j] represents the similarity between tables Ti and tables Tj . The similarity

between two tables is calculated by finding matching attributes based on a greedy matching strategy

[68]. The table similarity is then averaged out over the similarities of matching attributes.

In the next phase, clustering algorithms are implemented for each database representation model.

The vector-based model and similarity-based model use hierarchical agglomerative clustering algorithm

approach. A cluster quality metric is defined to measure the cluster quality. For the graph-based repre-

sentation, shortest path betweenness and spectral graph partitioning techniques are used for partitioning

the graph into connected components. Similar to other representations, a cluster quality metric is used

to measure the quality of connected components. After clustering process ends, the base-clusterer for

each representation selects the clustering with the highest quality score and preliminary clustering for

each representation is discovered.

After identifying preliminary centerings, iDisc uses a multi-level aggregation approach to aggregate

results from each clustering using a voting scheme to generate final cluseters. A clusterer boosting tech-

nique is also used in the aggregation approach by assigning weights to produce more accurate clustering

representations. Later, representatives for each cluster is discovered using an importance metric based

on centrality score of the tables in the graph-based representation. The output of iDisc is a clustering of

tables in the database, where each labeled clusters represents a topic.

3.2 Proposed Approach

In this section, we first present a case study for eSaguTM , an IT-based personalized agro-advisory

system. From the case study, we highlight our motivation and the problem we aim to solve. Later, we

discuss the proposed approach in detail.

3.2.1 Case Study

The eSagu system aims to improve the productivity of farms by delivering high quality personalized

(farm-specific) agro-expert advice in a timely manner to each farm at the farmer’s door-steps without

farmer asking a question. In eSagu, the agriculture scientist, rather than visiting the crop in person,

delivers the expert advice by getting the crop status in the form of both digital photographs and the

related information. The eSagu system records data about the farmers, farm history, sowing details,

soil details, crop details and information about problems/diseases observed by farmers. Agro-experts

need to analyze the observation data from various perspectives to deliver personalized advice and have

complex query requirements. Also, query requirements tend to change frequently. The agro-experts are

familiar with the data domain but are not technical experts. Hence, there is a need for a higher level

20

interface and presentation model to access data in the eSagu system. The issue here is that the query

interface proposed to elicit information requirement of non-technical users should be easy to use while

still allowing users to pose a wide range of queries.

The QBO approach and its merits have been discussed in Section 3.1. To design user interfaces based

on QBO to provide information requirement elicitation for eSagu, we face the following scenarios:

• Implement the eSagu system in a RDBMS, where each table would correspond to an object. The

eSagu database consists of 84 tables.

• Define operations between 84 × 84 object pairs.

• Provide a web-based interface providing a list of tables (84 tables) and instances (some tables

containing more than 104 rows).

Use Case: Consider the scenario when a user is trying to query the eSagu database using a web-based

interface designed using the developer’s protocol. The user protocol would include:

• Select an object: a user would have to analyze a list of 84 objects and locate his object of interest.

• Select granularity or instance selection: Even if instance selection is based on attribute values,

attributes can have a large number of distinct values.

• Select operator: A user would have to grasp how each object would relate to other objects.

A complex database may contain a large number of tables in the schema due to conceptual design or

schema normalization. In such cases, it is difficult for the user to locate his information of interest.

A naive solution, to organize objects alphabetically, may not be efficient. For example, in the eSagu

database, there are 35 tables for various crop observations, cotton observation, crossandara observation

and likewise 33 others. If a user wants to browse through all such observation tables, he would need to

know all the crop names. An organized list where crop observation tables are grouped together and then

sorted alphabetically would be more intuitive for the user. Hence when the objects are more in number,

there is a need to represent the objects at a higher level of abstraction. Similarly, there is a need for a

better organization when the object instances are more in number.

In general we are faced with the following problems for QBO developers and users:

• Large number of tables in the schema makes it harder for the user to locate his information of

interest.

• With a large number of instances in each table, selection of desired instance becomes difficult.

• With a large number of tables, the number of pairwise operations between tables also increase.

For n tables in the schema, in the worst case n× n operational pairs exist.

21

3.2.2 Basic Idea

In the proposed approach, we exploit the notion of detecting topical structures in databases to rep-

resent the schema at a higher level of abstraction. Identifying topical structures allows tables which are

semantically correlated to be grouped together, which provides a better organization for options pre-

sented to the users. Secondly, instead of defining operations between each pair of tables, we can define

operations between topics and within topics. Hence, the number of pairs for which operators have to

be defined can be reduced significantly. Similarly, to facilitate easier instance selection, we organize

instances of an attribute into bins, providing a two-level hierarchy for instance selection. The developer

protocol is modified to include steps required to generate the abstract levels. Consequently, the user

protocol is also modified for the proposed approach.

The proposed approach has the following additional processes to QBO:

• Organizing objects into topical structures.

• Facilitating instance selection

• Defining operators for the topical structure.

We discuss each of these process in detail in the following subsections.

3.2.2.1 Organization into topics:

For organizing objects into topical structures, we use the iDisc approach described in section 3.2.

Given a database containing a set of tables T = (T1, T2, ..Tn) as input, the iDisc framework generates a

clustering C = (C1, C2, ..Ck) of tables in the schema along with representative tables for each cluster

L = (L1, L2, ..Lk). Ci represents the set of tables belonging to the ith cluster, where Li represents the

representative table of the cluster Ci. The representative table’s name of a cluster is used to assign a

label to cluster. Each labeled cluster collectively represents a topic in the database.

In QBO approach, the hierarchy of information organization is as follows:

Tables→ Attributes→ Attribute Instances

After generating topical structures of the database, we make the following modification in the hierarchy

of organization:

Topics→ Tables→ Attributes→ Attribute Instances

In other terms, we introduce topics and present the database tables belonging to a topic as its granularity.

Hence, an object in QBT is a topic that has three levels of granularity (tables, attributes and attribute

instances), in contrast to QBO which had only attributes and attribute instances as the two levels of

granularity. Our approach is also in accordance with the IRE framework. By introducing topics, users

can browse the database contents semantically, providing more intuitive options to the users.

22

3.2.2.2 Facilitating Instance Selection:

For selecting an instance(s) of an object, selection based on an attribute values comes naturally to

the user. Thus, we first ask the user to select an attribute and then select its instances. However, in

case the number of instances of an attribute are large, we need an efficient organization of options.

Here we have two problems in conflict as while we allow the user to drill down to his requirements in

multiple steps, we may end up creating too many steps that are unfavorable for the user. We thus create

a two-level hierarchy for attribute values such that there are few steps required for instance selection

while providing a better organization. In the two-level hierarchy, we organize the attribute instances

by grouping the attribute instances into intervals. The first level represents the intervals and the second

level represents the instances itself.

Considering values of an attribute as a data distribution, we relate creating intervals to determine bins

for creating histograms for a given data distribution. Methods for calculating number of bins (k) given

a data distribution are as follows:

• Struge’s formula: k = ⌈log2n+ 1⌉

• Square root choice: k =√n

• Scott’s choice (based on bin width): h = 3.5σ

n1

3

, where h represents bin width

• Freedman-Diaconis’s choice: h = 2× IQR(x)

n1

3

, where IQR = interquartile range

We would like to point out that the aim of proposed approach is to make it easier for the user to select

instances. For example, if we have a textual attribute representing names of people in a community, one

simple solution can also be binning based on first alphabet of the name, rather than distribution. Taking

textual attributes into perspective, we additionally provide a search box for textual attributes, which can

act as a filter for attribute selection. The usability of the search tool becomes even more prominent if

the textual attributes contain long texts.

3.2.2.3 Defining Operations:

Next, we need to define operators that perform in case of QBT. Operators enable us to perform

complex queries on databases involving one or more objects. The selected objects act as operands to the

operators. We define two types of operator matrix:

i Within-Topic Operator Matrix (WTS): This matrix represents all possible operations within a

topic. The WTS matrix includes operations between a topic’s representative table with other tables

belonging to the topic and between the tables in a same topic.

23

Figure 3.3 Topical Structure for QBT

ii Between-Topics Operator Matrix (BTS): This matrix represents the possible operations between

the representative tables of each topic. The diagonal elements represent the WTS matrix of the

topics and other non-diagonal elements represent operations between two distinct topics.

By defining operational pairs between topics and within a topic, we reduce the number of operation

pairs for which operations need to be defined. The reduction in operational pairs depends on the topical

structure identified for the database. Figure 3.3 shows an example of the organization of tables into

topical structures. A topic is represented by its representative-table and all tables belonging to a topic

are called its subordinate tables. The first subscript represents the topic and second describes whether

the table is a representative table or a subordinate table of the topic. Tables of each topic are further

represented as a, b, and so on. Table 3.3 describes the Within Topic matrix for the first Topic (WT-I)

and table 3.4 describes the Between Topic matrix (BT). The following scenarios come up in context of

Figure 3.3,

t T11 T12a T12b T12c

T11 U,I,C R R R

T12a R U,I,C R R

T12b R R U,I,C R

T12c R R R U,I,C

Table 3.3 Within-Topic Matrix 1(WT-I)

t T11 T21

T11 [WT-I] R

T21 R [WT-II]

Table 3.4 Between-Topic Matrix(BT)

Scenario 1. The two selected objects belong to the same topic. It has further three possibilities:

24

• Both the tables are representative tables {T11,T11}: As there is only one representative table for

each topic, this represents operations between same tables. The possible operations will be pro-

vided in Within-Topic operator matrix (WT-I[1,1]).

• One of the table is representative-table and the other is a subordinate-table {T11,T12a}: This

case represents a RELATE operation between the two tables. The operations will be defined in

Within-Topic operator matrix (WT-I[1,2]).

• Both the tables are subordinate tables {T12a,T12b}: In this case, the two tables relate directly or

through the representative table of the corresponding topic. In this case, the operations are per-

formed at a higher level (WT-I[2,3]).

Scenario 2. The two selected objects belong to different topics. It has three further possibilities:

• Both the selected tables are representative-tables {T11,T21}: The possible operations will be de-

fined in Between-Topics-I operator matrix (BT-I[1,2]).

• One table is a representative-table and other is a subordinate table {T11,T22a}: In this case, the

tables can be related at the higher level via the representative-tables of the two topics (BT-I[1,2]).

• Both the tables are subordinate tables ({T12a,T22a}): Similar to the above case, the two tables

can be related through their representative-tables. The possible operations are defined in Between-

Topics-I matrix(BT[1,2]).

3.2.3 QBT protocols

In this section we describe the QBT developer protocol and QBT user protocol:

3.2.3.1 QBT Developer Protocol

• Store objects and entities in a database (RDBMS)

• Organize the tables in a schema based on the topic of tables, as described in Section 4.1.1.

• Create a framework to organize attribute instances into the two-level hierarchy, as explained in

Section 4.1.2.

• Define operations within each topic and between topics, described in Section 4.1.3.

25

• Provide an interface based on QBT, to allow object selection, instance selection and support clo-

sure property.

3.2.3.2 QBT User protocol

The user protocol for QBT is described in Figure 3.4. The main options in the QBT are as follows.

• Select a topic

• Select granularity (a table, attribute and attribute values)

• Select another topic

• Select an operation

• Display result

• Extend query, if required

Figure 3.4 QBT user protocol

3.3 System Prototype

In this section, we discuss the prototypes developed for QBO approach and the QBT approach based

on the notion of IRE. As shown in Figure 3.5, the system prototype is based on a client-server archi-

tecture. Users interact with the system with a web-based user interface (EQBO client), which allows

objects selection, operators selection and also displays query results. The back-end (EQBO server) con-

sists of a system that processes the inputs given by a user and generates an SQL query that is executed

on a relational database server (MySQL). The results of the SQL query are informed to the user at every

stage of interaction. The user interface was implemented in PHP using open source jQueryUI tools and

visual tools. Options available to user were refined by means of AJAX calls to the system, using JSON

objects for information transfer between client and server.

The developer protocols were followed to define objects and operations. For the QBO prototype,

each table in the database corresponds to an object. The attributes of a table are considered as its

26

Figure 3.5 System Prototype Architecture

granularity, based on which instances of the object can be selected. In the QBT prototype, we discovered

topical structures in the database, with topics corresponding to objects.

Operators are required when a user wants to relate information from one object with another object.

Analogous to a calculator, a query can expressed as: A op B = C, where A is the left operand, B is the

right operand, op represents the operator and C represents the result. A and B represent objects defined

in the database. Considering objects analogous to numbers in a calculator, operators can be unary, which

require a single left operand (A) as argument and binary operators, which require both left operand (A)

and right operand (B) as arguments. In a calculator, one or two objects of the same type (numbers)

operate and result into an object of same type (number). However in our system, objects are of different

types as each object corresponds to a table containing different attribute. Consequently, depending on

the operator selected, the resulting object can be of type A, type B or type A join B.

Four binary operators are defined for any general purpose database: ADD (union), MINUS (com-

plement), AND (intersect) and RELATE. For two same objects, binary operators ADD, MINUS and

AND operators are defined. For two different objects, binary operator RELATE provides a natural join

between objects. For each object, unary operators are defined corresponding to each direct join rela-

tionship it has with other objects. In addition to default operators, the database administrator can define

various domain specific operators to provide more flexibility to end users.

3.3.1 CONFIG-DB

Configuration information corresponding to different databases is stored in the CONFIG-DB (Figure

3.6). The CONFIG-DB consists of an object table which stores the name of objects identified from a

27

Figure 3.6 CONFIG-DB

database. The CONFIG-DB simply maintains an index for objects. Object granularity and attribute

values are accessed from the original database in which the object belongs. If the database has to be

represented as topics, a topics table is defined similarly to objects. Each topic has a representative object.

The objects belonging to the same topic have same topic id, otherwise topic id is null. In addition, the

unary operators table and binary operators table stores operator details such as left operand object, right

operand object, SQL query for the operator, resultant object and icon location to visually represent the

operator in the user interface. For any database, by default each table indexed as an object and default

binary and unary operators are defined. However, the CONFIG-DB can be re-populated by the database

administrator to allow topic representation or to design more operators.

3.3.2 User Interface

The design of user interface was motivated by the aim to provide an interface analogous to a tradi-

tional calculator. As presented in Figure 3.7, selection of numbers is analogous to object selection (red),

operator selection for numbers is analogous operator selection for objects(blue). The display section of

a traditional calculator that displays the result, numbers or operator selected is analogous to a query-bag

that keeps track of user’s interactions and intermediate results (green). Figure 3.8 depicts the user in-

28

Figure 3.7 Traditional calculator versus System Prototype UI

terface, which consists of four section (each represented by numbered arrows). The first section (1) of

Figure 3.8 describes the interaction process of selecting objects, granularity and instances. In general,

this section should describe an efficient representation of database schema and data. It has been widely

studied that visual representation of objects can be easily manipulated by the user. However, for large

scale databases that contain large number of tables, attributes and attribute values, visual representation

are complex and restricted by screen size. Consider a database with 100 tables, each table consisting of

15 attributes on average and each attribute containing 1000 distinct attribute values on average. Visual

representations like treemaps, which provide a compact representation of hierarchal data suffer badly as

the screen becomes densely packed. Similarly, graph representations, which can represent both object

and relationships between objects also suffer as the network structure becomes dense and confusing to

the user. To deal with large scale database, we use cascading menus to represent the database hierarchy.

The left most menu represents objects, grouped by topics. On selection of an object, its attributes are

represented in the second menu. Subsequently, attribute selection leads to the third menu consisting of

attribute values. Since the attribute values are probable to be more in number, a search box is provided

to users for locating desired information. The second section (2) of Figure 3.8 represents the operator

selection. Operators are represented as a grid of buttons similar to a grid of operators in a traditional

calculator. The operator grid is updated based on object selected. In addition to operators, the grid

contains a ‘backspace’ and ‘calculate’ button to undo previous selections and evaluate an expression

respectively. An important design choice is the use of icons along with the textual representation of

operator functionality. An icon can provide a visual representation of the functionality of the operator

that can be easily manipulated by the end user. For example, a + icon is displayed for an operation

where you want to add some more instances an object to an existing object. The third section (3) of

Figure 3.8 describes the representation of a query-bag which keeps track of user’s selections (objects

and operators), similar to the display section a calculator. In general, user’s selection will represent a

very small subset of options available. We can thus use visual tools to represent user’s selections and

29

Figure 3.8 System Prototype UI

Figure 3.9 Treemap representation of user’s selection(object and granularity)

operation results. Object selection is represented through treemaps as they provide a compact represen-

tation for hierarchal data and operator selection is represented via icons. For example, Figure 3.9 shows

the treemap representation of user selection of a farmer object with granularity for gender as male and

age as 20, 21 or 22. A point to note is that the treemaps are not displayed until the user has made a

selection on an object. The fourth section (4) of Figure 3.8 represents results of SQL query, formulated

based on user’s interactions. Each selection made by the user updates the query, correspondingly a SQL

query is executed on the database server and the SQL query results are presented back to the user in

real time. The SQL query results are displayed using the query-by-example (QBE) approach. Real time

presentation of SQL query results allows users to validate their existing selections at every stage and

reduces the probability of formulating a wrong query.

30

3.4 Experiments

3.4.1 Experimental Methodology

To analyze the effectiveness of the proposed approach, we conducted system-level experiments and

also conducted a usability study. System-level experiments consist of evaluating the reduction in navi-

gation burden and reduction in the number of operational pairs from the QBO approach. The usability

study consists of a task analysis and ease of use survey on a real database using real users. For the

usability study, we developed two prototypes, one based on the QBO approach and one based on the

QBT approach. The interface for both the approach is almost similar, except that the QBO prototype

does not group object by topics and does not provide bins for instances. First we do a task analysis

of the QBT prototype and QBO prototype to check whether the proposed approach is beneficial to the

user. To compensate for the limitations of the task analysis (discussed later in Section 3.4.3.1), we ask

the user to explore the database on their own and pose queries from their day-to-day requirements using

both the prototypes. After the exploration session, they fill out a questionnaire, rating the prototypes. It

may not be the most efficient usability evaluation but it reduces the bias from task analysis.

3.4.2 Performance Analysis

We measure the effect of using topical structures at the system level by measuring the reduction

factor (RF ) for operational pairs. The reduction factor represents the number of operation pairs in the

QBT approach as compared to the QBO approach (RFop). If the number of operation pairs in QBT are

OPqbt and in case of QBO are OPqbo, the reduction factor (RFop) is defined as follows:

RFop = 1− OPqbt

OPqbo

(3.1)

We illustrate the metric by referring to Figure 3.3, where the total number of tables are 8. When

tables have been divided into two topics, the number of operation pairs are follows: two 4 × 4 WT

matrix and one diagonal BT matrix (2 pairs). Hence OPqbt is 34, while OPqbo is 64 (8 × 8). The

reduction factor for operation pairs is 0.46. For the eSagu database, after identifying topical structures,

operational pairs were calculated for the between topics matrix (BT) and within topics matrix (WT). The

reduction factor for operational pairs (RFop) observed was 0.76.

3.4.3 Usability Study

Usability tests were conducted on four real users having computer experience but were not skilled at

SQL or query languages. The users belonged to the age group 20-26 and were agriculture consultants

at IT for Agriculture lab, IIIT Hyderabad. The users were familiar with the database domain, mainly

eSagu and can validate the query results comfortably. Users were then briefed about the QBT prototype

for 15 minutes along with a quick demonstration of a sample query. Before the experiments, users

31

Figure 3.10 QBO Approach Prototype Figure 3.11 QBT Approach Proto-

type(with Topic modeling and bin-

ning)

Time taken in seconds

Task (attempts taken)

User1 User2 User3 User4

T1 21(1) 16(1) 41(2) 22(1)

T2 18(2) 31(2) 30(1) 27(1)

T3 170(3) 81(2) 79(1) 112(2)

T4 17(1) 18(1) 22(1) 25(1)

T5 25(1) 18(1) 41(2) 24(1)

T6 140(2) 151(2) 110(2) 103(2)

Table 3.5 Time taken and number of attempts for each task

were allowed a 5 minutes practice session to get themselves acquainted with the tool before starting the

experiments. We performed two experiments: Task analysis and Use Survey [69].

3.4.3.1 Experiment 1, Task Analysis:

After the initial interactive session the users were given six tasks. The tasks are as following:

• T1: Find the details of family members for the farmer D.Laxama Reddy.

• T2: Find all the farms owned by the farmer named Polepally Thirumalreddy.

• T3: Find all the observations given to farmers from Malkapur village who grow cotton crops.

• T4: Find the details of livestock belonging to the farmer d.laxama reddy.

• T5: Find all the farmers belonging to the coordinator named k .s narayana.

• T6: Find all the advice given to farmers from Malkapur village.

32

Task Min Time Max Time Average Std. Deviation Avg. time for query construction

T1 16 41 25 10.98 20

T2 18 31 26.5 5.91 17.66

T3 79 170 110.5 42.44 55.25

Table 3.6 Query building time results for QBT

Task Min Time Max Time Average Std. Deviation Avg. time for query construction

T4 17 25 20.5 3.69 20.5

T5 18 41 27 10.29 21.6

T6 103 151 126 23.13 63

Table 3.7 Query building time results for QBO

Each task involved constructing a query corresponding to the task requirement and retrieving the

correct result. Ideally, we would like to evaluate the two prototypes by evaluating results for same

tasks. However, if the user performs tasks on one of the prototype and then performs the same task

on another prototype, the second prototype would be at an advantage because the user already gains

a prior experience about performing the tasks. To address this issue, we instead divide the tasks into

two groups of three tasks each. The first three tasks (T1, T2 and T3) would be performed on QBT

prototype while last three tasks (T4, T5 and T6) would be performed on the QBO prototype. While we

have different tasks being performed for the two prototypes, we try to maintain that the tasks are similar

in nature and complexity. We maintained that the task T1 is similar to task T4 with the difference in

objects corresponding to the family details and livestock details. Similarly, T2 and T5 represent a join

operation, differing only in terms of the object involved. Tasks T3 and T6 represent a complex join

involving three objects.

Table 3.5 shows the time taken by each user to build his query for all six tasks and also the total

number of attempts taken to complete each task. Note that we only account for the time taken by the

user to build the query and not the time taken by the system to execute the query. The average time to

complete all the tasks was 5 minutes and 36 seconds. The longest time to complete the six tasks was

6 minutes 31 seconds while the fastest time was 5 minutes and 13 seconds. The standard deviation to

complete all six tasks was 37 seconds.

Table 3.6 and Table 3.7 show the query building times for the two prototypes. Additionally, for QBT,

the average time to complete the first three tasks (T1, T2 and T3) successfully was about 2 minutes

and 35 seconds. The longest time taken to complete the tasks on the QBT prototype was 3 minutes

and 29 seconds and the fastest time was 2 minutes and 3 seconds. The standard deviation of time to

complete the first three queries was 40 seconds. For QBO, the average time for the last three tasks (T4,

T5 and T6) was 2 minutes and 54 seconds, while the longest time was 3 minutes and 7 seconds and

the fastest time was 2 minutes and 32 seconds. The standard deviation of time to complete the last

three tasks was 16 seconds. The average number of trials required by users to complete all six tasks

was 9.75. The average number of attempts required to complete the first three tasks was 4.75 while the

33

Figure 3.12 Average ratings for questions from questionnaire

average number of attempts required to complete the last three tasks was 4.25. The maximum number of

attempts required by a user for any single task was 3 (for T3). The average time for query construction

for the first three tasks was 34 seconds while the average time for query construction for the last three

tasks was 41 seconds.

As discussed in the experimental methodology, as the tasks are performed first on QBT prototype and

then the rest of the tasks are performed on the QBO prototype. The QBO prototype has an advantage

that the user is already accustomed to perform similar tasks earlier on the QBT prototype. However,

the average time for query construction for QBT is less than for QBO which shows that users are able

locate their information quicker in QBT than in QBO.

3.4.3.2 Experiment 2, Use Survey:

After the task evaluation, we conducted a survey to determine how the users felt about the prototypes

individually. Users were asked to explore the prototypes and pose various queries from their day-to-day

requirements. After the users had explored the database using the two prototypes they were asked to fill

in a questionnaire based on a USE survey. The questionnaire asked the users to rate both the prototypes

based on the following questions:

• Q1: The tool is easy to use.

• Q2: The tool sufficient for my information requirements

• Q3: The tool can be used with minimal efforts.

• Q4: The tool requires minimal training and can be used without written instructions.

• Q5: I can locate my information easily.

• Q6: The tool requires minimal steps to formulate a query.

34

The users had to respond to each question on a scale ranging from 0 (completely disagree) to 10

(completely agree). Finally, each user was requested to give some feedback about the general perception

of the prototypes, to obtain and identify additional comments about strengths and weaknesses to improve

the tool.

In figure 3.12, we represent the average ratings provided by the users for each of the questions. The

mean rating for the QBT prototype was 6.95 with a standard deviation of 0.24. The mean rating for

the QBO prototype was 6.33 with a standard deviation of 0.30. For Q1, the QBT prototype received

an average rating of 7.25 while QBO prototype received an average rating of 6.5. For Q2, the QBT

prototype received an average rating of 6.75 while QBO received an average rating of 6.25. For Q3, the

QBT prototype received an average rating of 7 while the QBO prototype received an average rating of

6.5. For Q4, both the QBT prototype and the QBO prototype received an average rating of 6.75. The

two prototypes do not differ much in the user interface design as much as in the process of interaction.

For Q5, the QBT prototype received an average rating of 7.25, whereas the QBO prototype received an

average rating of 6. In QBT prototype, we have introduced topics to organize objects that help the user

to locate objects quickly. For Q6, the QBT prototype received an average rating of 6.75, whereas the

QBO prototype received an average rating of 6. The highest ratings for the QBT prototype was received

for questions Q1 and Q5 while the lowest ratings were received for Q2, Q4 and Q6 alike. For the QBO

prototype, the highest rating was received for Q4, and the lowest rating was received for questions Q5

and Q6.

From the use survey, we see that the QBT prototype receives highest ratings for Q1 and Q5 which

shows that after exploring the data through the prototype, the users feel that the QBT prototype easy

to use and they can locate their desired information quickly compared to QBO. On the other hand, the

lower ratings for Q2, Q4 and Q6 for QBT specifically show that there is still scope for improvements, as

users feel that they are not able to express all their requirements. After users had been given the freedom

to explore both the prototypes, we see that QBT prototypes in general received higher ratings than the

QBO prototypes. Although the difference is ratings is not highly significant, the difference in ratings

shows the preference of QBT over QBO.

3.4.3.3 Limitations and possible improvements for the usability study

The users for our usability study were a group of agricultural experts working in the IT for Agricul-

ture lab, IIIT Hyderabad. They matched our target audience of users that are unfamiliar with database

systems but are familiar with the data they want to query. However, the usability study could have been

conducted iteratively with different groups of users rather than a single group of agricultural experts.

Another scope for study was having an expert review of the prototypes.

We use a limited set of six questions in the survey. In [68], the authors have described an array of

questions that could be used for a detailed study of user behavior. The usage of other popular ques-

tionnaires like System Usability Scale (SUS) was also an alternative. The questionnaire could have also

included questions that directly compare the two prototypes. Measuring the internal consistency or a

35

reliability score could have been used to validate our questionnaire. We use the mean average ratings

to evaluate our study while other measures like standard deviation and also the correlation between

questionnaire ratings could be studied to have more detailed analysis.

For the task analysis, we made the user perform three similar tasks on two prototypes. While users

completed their first three tasks on the one prototype, they become experienced to complete their tasks

on the other prototype. This creates a bias for one of the prototype. Similarly, we could not let the users

complete the same task on both the prototype, which would have again created a bias.

3.5 Summary of the chapter

Accessing a database requires the user to be familiar with query languages. The QBO approach,

based on IRE framework provides an interface where a user progressively builds queries using multiple

steps. This approach works fine for small databases but cannot perform well for a database consisting

of large number of tables and rows. In this chapter, we propose Query-by-Topics, which provides

enhancements over the existing QBO approach. We exploit topical structures in large databases to

represent objects at a higher level of abstraction. We also organize instances of an object in a two-level

hierarchy based on a user selected attribute. The advantages of this approach includes: user gets less

navigational burden and the number of operations is reduced at the system level. The QBT prototype

was implemented for a real database and experiments were conducted at the system level and user level

to discuss the advantages.

36

Chapter 4

Exploiting Schema and Documentation for Summarizing Relational

Databases

According to a recent study, users take more time to express and formulate their query requirements

compared to the time taken for executing the query and displaying the result [70]. With the increase

in complexity of modern day databases, users spend a considerable amount of time in understanding

a given schema in order to locate their information of interest. To address these issues, the notion of

schema summarization was proposed in the literature [25, 18].

Schema summarization involves identifying semantically related schema elements, representing what

users may perceive as a single unit of information in the schema. Identifying abstract representations

of schema entities helps in efficient browsing and better understanding of complex database schema.

Practical applications of schema summarization are as follows:

• Schema Matching [71, 59] is a well researched issue. Schema matching involves identifying

mappings between attributes from different schemas. After identifying abstract representations of

schema elements, we can reduce the number of mapping identification operations by identifying

mappings at an abstract level rather than schema level.

• In Query Interfaces, users construct their query by selecting tables from the schema. A quick

schema summary lookup might help the user in understanding where his desired information is

located and how is it related to other entities in the schema.

The problem of schema summarization has gained attention recently in the database community.

Existing approaches [18, 19, 20] for generating schema summary exploit two main sources of database

information, the database schema and data stored in the database. In another related work, Wu et al.

[21] described an elaborate approach (iDisc) for clustering schema elements into topical structures by

exploiting the schema and the data stored in the database.

In this chapter, we propose an alternative approach for schema summarization by exploiting the

documentation of the database, in addition to its schema. It can be noted that we investigated how

documentation of the database provides the scope for efficient schema summarization. The database

37

Figure 4.1 TPCE Schema

documentation contains domain specific information about the database which can be used as an in-

formation source. For each table, first we identify the corresponding passages in the documentation.

Later, a table similarity metric is defined by exploiting similarity of the passages describing the schema

elements in the documentation and the referential relationships between tables. Using the similarity met-

ric, a greedy weighted k-center clustering algorithm is used for clustering tables and generating schema

summary. The experimental results on the TPCE [72] benchmark database shows the effectiveness of

the proposed approach.

The rest of the chapter is organized as follows: In section 4.1, we describe the proposed approach

including the basic idea, table similarity measure and clustering algorithm. In section 4.2, we discuss

the experimental results and analysis. Section 4.3 includes conclusions and future works.

4.1 Proposed Approach

We use the TPCE schema [72] described in Figure 4.1 as the running example in this chapter. The

TPCE schema consists of 33 tables that are grouped into four categories of tables: Customer (blue),

Market (green), Broker (red) and dimension (yellow). This categorization is provided by the TPCE

benchmark and it also serves as the gold standard for evaluation of our experiments.

38

Existing approaches for clustering database tables are data oriented, utilizing schema and data in the

database for generating schema summary. In scenarios where the data is insufficient, or some tables

do not contain data, we have to look for alternate sources of information. For example, in the TPCE

benchmark database, if no active transactions are considered, the table trade request is empty and hence,

cannot be considered for clustering in existing approaches. Thus, we investigate alternative sources of

information for a database. Databases are accompanied with the documentation or the requirement

document. These documents contain domain specific information about the database that could be

exploited for generating schema summary. Although one can go through the documentation and infer

the schema summary manually, it is not always feasible to do so. Documentations for enterprise database

are generally large, spanning hundreds of pages. The documentation for TPCE is 286 pages long and

manually going through the documentation will thus be a tedious process for the user.

In the proposed approach, we aim to propose an efficient approach for schema summary generation,

using only schema and the documentation.

4.1.1 Basic Idea

A foreign key relationship between two tables shows that there exists a semantic relationship be-

tween two tables. However, referential relationships alone do not provide good results [20]. Hence,

we attempt to supplement this referential similarity between tables with another notion of similarity,

such that the tables belonging to one category attain higher intra-category similarity. This additional

similarity criterion is based on finding similarity between the passage of text representing the table in

the database documentation. The intuition behind this notion of similarity is that the tables belonging

to the same categories should share some common terms about the category in the documentation. We

combine the referential similarity and document similarity by means of a weighted function and obtain

a table similarity metric over the relational database schema. After pairwise similarity between tables is

identified, we use a Weighted K-Center clustering algorithm to partition tables into k clusters.

We propose a measure for table similarity. The measure has two components: one based on referen-

tial relationship and the other based on similarity of corresponding passages in the documentation. We

first explain about the components and then present the table similarity measure.

4.1.2 Schema based Table Similarity

In a relational database, foreign keys are used to implement referential constraints between two ta-

bles. The presence of foreign keys thus implies that the two tables have a semantic relationship. Such

constraints are imposed by the database designer or administrator and form the basic ground truth on

the similarity between tables. In our approach, referential similarity between two tables R and S is ex-

pressed as RefSim(R,S).

39

Security Daily Market Watch Item

Security - 1 1

Daily Market 1 - 0

Watch Item 1 0 -

Table 4.1 Referential Similarity between tables security, daily market and watch item

RefSim(R,S) =

{1 , If R,S have foreign key constraint

0 , Otherwise.

Example1: Consider the three tables Security, Daily market and Watch item (S, D and W ) in the TPCE

schema. Table security has a foreign key relationship with daily market and watch item, hence

RefSim(S,D) = RefSim(D,S) = 1 and RefSim(S,W ) = RefSim(W,S) = 1. The pairwise

similarity is described in Table 4.1.

4.1.3 Documentation based Table Similarity

In addition to the referential similarity, we also try to infer the similarity between tables using

database documentation as an external source of information. First, we find the passage describing

the table in the documentation using passage retrieval approach. The similarity between two tables thus

corresponds to the similarity between the corresponding passages in the documentation. The passage

from the documentation representing a table Ti is referred to as the table-document of Ti, TD(Ti). The

first task is to identify the table-document for each table from the documentation. Later, we find pairwise

similarity between the table-documents.

4.1.3.1 Finding Relevant Text from the Documentation:

Passage retrieval [73, 74, 75, 76, 77] is a well researched domain. Passage retrieval algorithms return

the top-m passages that are most likely to be the answer to an input query. We use a sliding window

based passage retrieval approach similar to the approach described in [78]. In this chapter, we focus on

using a passage retrieval approach to evaluate table similarity from database documentation rather than

comparing different approaches for passage retrieval from the documentation.

Consider a table Ti with a set of attributes Ai = (Ai1, Ai2..Aik). Given a database documentation

(D), for each table Ti we construct a query Q(Ti) consisting of the table name and all its attributes as

keywords.

Q(Ti) =< Ti, Ai1, Ai2..Aik > (4.1)

In a sliding window based passage retrieval approach, given a window size wi for Ti, we search wi

continuous sentences in the document sequentially for the keywords in Q(Ti). If at any instance, the

40

window matches all the keywords from Q(Ti), the passage in the window is considered a potential table-

document for Ti. In cases where multiple windows are identified, we implement a ranking function [79]

for the retrieved passages and choose the passage with the highest ranking score. The selection of an

appropriate window size is a crucial step as the number of keywords in Q(Ti) varies for each Ti. We

propose two types of window functions (f(Q(Ti))):

• Independent window function, f(Q(Ti)) = c, c being a numeric constant.

• Linear window function, f(Q(Ti)) = a× |Q(Ti)|+ c, c being a numeric constant.

After the passage describing the table is identified, we store the passage in a separate document and

represent it as the table-document TD(Ti) for the table table Ti.

4.1.3.2 Similarity of passages:

Once the table-documents have been identified, we have a corpus containing table-document(s) for

each table. The table-document(s) are pre-processed by removing stop-words and performing stemming

using Porter Stemmer. The table-document can be modeled in two ways:

• TF-IDF Vector: TD(i) = (w1, w2, ..wd) can be represented as a d-dimension TF-IDF feature

vector, where d = |corpus| and wi represents the TF-IDF score for the ith term in TD(i).

• Binary Vector: TD(i) is represented as a d-dimension binary vector TD(i) = (w1, w2, ..wd),

where d = |corpus| and wj is 1 if TD(i) contains the term wj and 0 otherwise.

We then calculate pairwise similarity between table-documents using the cosine similarity measure

or the jaccard coefficient:

DocSimcos(R,S) = DocSim(docR, docS) =docR.docS

|docR| × |docS |(4.2)

DocSimjacc(R,S) = DocSim(docR, docS) =docR ∩ docS

|docR| ∪ |docS |(4.3)

4.1.4 Table Similarity Measure

For two tables R and S, let RefSim(R,S) represent the referential similarity and DocSim(R,S)

represent the document similarity between R and S. We combine the referential similarity and document

similarity using a weighing scheme as

Sim(R,S) = α×RefSim(R,S) + (1− α)×DocSim(R,S) (4.4)

Where α is a user specified parameter called the contribution factor 0 ≤ α ≤ 1. It measures the

contribution of referential similarity to the table similarity. In some cases, two tables have a low value

41

Algorithm 1 Finding Table Similarity

Input: D: Database Schema, TD: Set of Table-Document vectors, S: Document similarity measure,

α: Contribution factor,

Output: Sim: Pairwise similarity between tables in database

RefSim← REFERENCE-SIMILARITY(TD,S)

DocSim← DOCUMENT-SIMILARITY(D)

Sim← α×RefSim+ (1− α)×DocSim

for all tables as k do

for all tables as i do

for all tables as j do

if Sim(i, k) × Sim(k, j) < Sim(i, j) then

Sim(i, j) ← Sim(i, k) × Sim(k, j)end if

end for

end for

end for

return Sim

procedure REFERENCE-SIMILARITY(D)

for all tables as R do

for all tables as S do

if R,S have foreign key relationship in D then

RefSim(R,S)← 1else

RefSim(R,S)← 0end if

end for

end for

return RefSim

end procedure

procedure DOCUMENT-SIMILARITY(TD, S)

for all tables as R do

for all tables as S do

DocSim(R,S)← S(TD(R), TD(S))end for

end for

return DocSim

end procedure

42

of (combined) similarity, but have high similarity to a common neighboring table. For example, in

Figure 4.1, tables account permission(AP ) and customer(C) do not have a referential similarity

but both are similar to the table customer account(CA). In such cases, two tables gain similarity as

they have similar neighbors. For the previous example, similarity between account permission and

customer should be max(Sim(AP,C) , Sim(AP,CA)× Sim(CA,C)).

We construct the undirected database graph G = (V,E), where nodes (V ) correspond to tables in the

database schema. For any two tables R and S, we define an edge representing the combined similarity

Sim(R,S) between the tables. The database graph G is a complete graph.

Consider a path p : R = Ti, Ti+1, Ti+2, ...Tj = S between two tables Ti and Tj . Similarity between

the tables Ti and Tj along path p is

Simp(R,S) =

j−1∏

k=i

Sim(Tk, Tk+1) (4.5)

Then the path with the maximum similarity between R and S gives the complete similarity between R

and S.

Sim(R,S) = maxpSimp(R,S) (4.6)

As we construct a complete graph, we use the Floyd-Warshall algorithm for finding the shortest paths

in a weighted graph. In our case, we define the shortest distance as having the maximum similarity.

Since we construct a complete graph for finding all pairs maximum similarity paths, the algorithm takes

O(n3) running time for this step.

Algorithm 1 describes the procedure for calculating the pairwise similarity between tables in a

schema. By taking the database schema, set of extracted passages, a document similarity measure and

contribution factor as input, the algorithm returns pairwise similarity between tables. First we calculate

the referential and document similarity for O(n2) pairs and later combine them using the contribution

factor. The procedure REFERENCE-SIMILARITY() takes as input the database schema and calculates

the similarity between two tables based on the referential relationships. The procedure DOCUMENT-

SIMILARITY() takes as input the passage corresponding to each table, a document similarity measure

and calculates the similarity between tables based on the similarity of corresponding passages of the

tables. Note that for every table, the passage is extracted by employing the passage retrieval approach

described in Section 4.1.3.

4.1.5 Clustering Algorithm

For generating summary, we use a greedy Weighted K-Center clustering algorithm. It provides min-

max optimization problem, where we want to minimize the maximum distance between a table and its

cluster center.

43

4.1.5.1 Influential tables and Cluster Centers

In schema summarization, the notion of influential table is used for clustering [20]. The notion says

that the most important tables should not be grouped in the same cluster. We measure the influence

of a table by measuring the influence one table has on other tables in the schema [80]. Specifically, if

a table is closely related to large number of tables in the database, it will have a high influence score.

The influence score helps in identifying the cluster centers, described in the clustering process. The

influence of a table R on another table S in the database schema is defined as

f(R,S) = 1− e−Sim(R,S)2 (4.7)

Influence score of a table is thus defined as

f(R) =∑

tiǫT

f(R, ti) (4.8)

where T represents the set of tables in the database.

4.1.5.2 Clustering Objective Function

The clustering objective function aims to minimize the following measure [20].

Q = maxki=1maxRǫCif(R)× (1− Sim(R,Center(Ci))) (4.9)

where k in the number of clusters, f(R) is the influence score of table R and Center(Ci) represents

the center of the ith cluster (Ci).

4.1.5.3 Clustering Process

We use the Weighted K-Center algorithm that considers the influence score for clustering. In this

approach, the most influential table is selected as the first cluster center, and all the tables are assigned

to this cluster. In each subsequent iterations, the table with lowest weighted similarity from its cluster

center separates out to form a new cluster center. The remaining tables are re-assigned to the closest

cluster center. We repeat the process for k iterations, such that k clusters are identified for the database

schema. The time complexity of the greedy clustering algorithm is O(kn2) [81], where n is the number

of tables in the schema.

4.2 Experimental Results

In this section, we present results of experiments conducted on our proposed approach. The following

variables have been used at different stages in our approach:

• Window size function (f ) for table-document discovery.

44

• Document similarity measure (S), for calculating the similarity of passage describing the tables

in the documentation.

• α, the contribution factor in combined table similarity metric.

• k, the number of clusters determined by the clustering algorithm.

Varying any of the variables affects the table similarity metric and clustering. We study the influence

of these variables by varying one variable while keeping the other variables constant. Later, we conduct

experiments on the clustering algorithm and compare our approach with other existing approaches.

4.2.1 Experimental Setup

We used the TPCE database [72], provided by TPC. It is an online transaction processing workload,

simulating the OLTP workload of a brokerage firm. TPC also provides a software package EGen to

facilitate the implementation of the TPCE database. We used the following parameters to implement an

instance of TPCE: Number of Customers = 5000, Scale factor = 36000, Initial trade days = 10.

The TPCE schema consists of 33 tables, which are grouped into four categories: Customer, Market,

Broker and Dimension. We use this categorization as the gold standard to measure the accuracy of our

approach. The dimension tables are not an explicit category, they are used as companion tables to other

fact tables and hence can be considered as outliers to our clustering process. We thus aim to cluster the

other 29 tables any measure the accuracy of these 29 tables to the given gold standard.

In addition, TPCE also provides the documentation for the TPCE benchmark. It is a 286 page long

document and contains information about TPCE business and application environment, the database and

the database transactions involved. This document serves as an external source in the proposed schema

summarization approach.

4.2.2 Evaluation Metric

The accuracy of clustering and table similarity metric is evaluated by means of an accuracy score,

proposed in [20]. The accuracy score has different connotations for clustering evaluation and table

similarity evaluation. For the table similarity metric, we find the top-n neighbors for each table based on

the Sim metric described in Equation (4.6). Unless specifically mentioned, we find the top-5 neighbors

in our experiments. From the gold standard, if category of table Ti is Ca, mi is the count of the tables in

the top-n neighborhood of Ti belonging to the same category as Ca, then average accuracy of similarity

metric is defined as

accsim =

∑iǫT

mi

n

|T | (4.10)

Similarly for clustering accuracy, consider a cluster i containing ni number of tables. If the category of

the cluster center of a cluster i is Ca; let mi denote the count of tables in the cluster that belong to the

45

category Ca. Then accuracy of the cluster i and overall clustering accuracy is

accclusti =mi

ni

(4.11)

accclust =

∑iǫT mi

|T | (4.12)

Figure 4.2 accsim and accclust values on varying window

function, f

Figure 4.3 accsim and accclust values for doc-

ument similarity functions S

4.2.3 Effect of window function (f ) on combined table similarity and clustering

In this experiment we measure the impact of varying the window function f for window size (w)

on the clustering accuracy and table similarity metric. We fix α = 0.5, k = 3 and use the tf-idf based

cosine similarity for table-document similarity. We conduct an experiment with the following window

functions

• wi = f(Q(Ti)) = 10

• wi = f(Q(Ti)) = 20

• wi = f(Q(Ti)) = 2× |Q(Ti)|+ 1

• wi = f(Q(Ti)) = 3× |Q(Ti)|+ 1

The results of this experiments are shown in Figure 4.2. We observe that although the function

f = 20 gives respectable results, it is hard to determine a value of such constant ( f = 10 gives poor

results). Using a constant window size can cause loss of information in some cases or add noise in

other cases. To be on the safe side, linear window functions, which gave comparatively similar results

are preferred. In further experiments, unless specified specifically, we use the window function as

f(Q(Ti)) = 2× |Q(Ti)|+ 1.

46

0 0.2 0.4 0.6 0.8 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

α

acc si

m

Without dimension tablesWith dimension tables

Figure 4.4 Accuracy of similarity metric on varying

values of α

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α

acc cl

ust

Figure 4.5 Accuracy of clustering on varying values

of α

4.2.4 Effect of document similarity measure (S) on similarity metric and clustering ac-

curacy

The table-documents identified for each table can be of variable length. We study two similarity

measures described in Equation (4.2) and Equation (4.3): Cosine similarity and Jaccard similarity. We

compare the accuracy of similarity metric and clustering algorithm for the two similarity measures for

k = 3, α = 0.5 and f(Q(Ti)) = 2 × |Q(Ti)| + 1. The results of the experiments are shown in Figure

4.3. We observe that the tf-idf score based Cosine similarity measure is consistent with the results.

This can be attributed to the fact the table-documents share a lot of similar terms about the domain

of the document and hence term frequency and inverse document frequency play an important role in

determining the score of the terms in a document.

4.2.5 Effect of contribution factor (α) on table similarity and clustering

In this section we measure the impact of varying α on the clustering accuracy and table similarity

metric. In this experiment, we fix w = 2×|Q(Ti)| and k = 3, while varying α from 0 to 1. Figure 4.4 and

Figure 4.5 show the results of varying α on clustering accuracy and accuracy of similarity metric. One

interesting observation is we achieve the best clustering accuracy when the contribution of referential

similarity and document similarity are almost equal (α = 0.4, 0.5, 0.6). This shows that rather than

one notion of similarity supplementing the other, both similarities have equal importance in generating

schema summary. Also using any single similarity measure (when α is 0 or 1) produces low accuracy

results which verify the claims made in this chapter.

47

2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

acc si

m

clusts

clustv

clustd

clustc

Figure 4.6 Clustering accuracy for different clustering algorithms

4.2.6 Comparison of Clustering Algorithm

In this section, we compare clustering algorithms for schema summary. In addition to the proposed

weighted k-center clustering algorithm using an influence function (Clusts), we implement the follow-

ing clustering algorithms:

• Clustc, A community detection based schema summarization approach proposed in [19].

• Clustd, the schema summarization approach proposed in [20]. The clustering approach uses a

table importance metric based weighted k-center clustering algorithm.

• Clustv, Combines results from clustering using reference similarity and clustering using doc-

ument similarity using a voting scheme similar to [21]. This algorithm focuses on combining

clustering from different similarity models rather than combining similarity models.

Figure 4.6 shows the clustering accuracy achieved for k = (2, 3, 4) for various clustering algorithms.

We observed that clusts and clustd achieve almost similar accuracy, with clusts giving slightly higher

accuracy as it was able to successfully cluster the table trade request. If no active transactions are

considered for the TPCE database, the table trade request is empty and data oriented approaches are

unable to classify the table. For the clustv and clustc approaches, no specific patterns were observed.

The reason for low accuracy of clustv is because referential similarity provides a very imbalanced and

ineffective clustering that deters the overall clustering accuracy in the voting scheme significantly.

4.3 Summary of the chapter

Schema summarization has been proposed in the literature to help users in exploring complex database

schema. Existing approaches for schema summarization are data oriented. In this chapter, we proposed

48

a schema summarization approach for relational databases using database schema and the database doc-

umentation. We proposed a combined similarity measure to incorporate similarities from both sources

and proposed a framework for schema summary generation. Experiments were conducted on a bench-

mark database and the results showed that the proposed approach is as effective as the existing data

oriented approaches.

49

Chapter 5

Conclusion and Future work

With the rapid increase in the amount of published information and the explosion of data, users

require sophisticated tools to simplify the task of managing data and extracting useful information in

timely fashion. Subsequently, databases and database systems are essential to every organization for its

business operations.

Accessing information stored in a database requires the user to be familiar with query languages.

Naive users are not skilled at using a general purpose query language like SQL, which has a complex

structure. As a result, research efforts are on to provide easy to use query interfaces with expressive

power comparable to SQL.

The QBO approach, based on IRE framework provides an interface where user progressively builds

a query using multiple steps. The QBO approach works fine for small databases but cannot perform

well on a database consisting of large number of tables and rows. In this thesis, we propose Query-by-

Topics, which provides enhancements over the existing QBO approach. We exploit topical structures

in large databases to represent objects at a higher level of abstraction. We also organize instances of

an object in a two-level hierarchy based on a user selected attribute. The advantages of this approach

include user gets less navigational burden and the number of operations is reduced at the system level.

We also implemented a system prototype for a real database and made efforts to extend it for any

general purpose database. Experiments were conducted at the system level to estimate the reduction

in navigational burden and reduction in the number of operational pairs. A usability study was also

conducted using the system prototype to evaluate our efforts against human factors.

A key step in the proposed approach was to represent schema elements at a higher level of abstrac-

tion. Schema summarization has been proposed in the literature to cluster database schema entities and

to present a high-level abstraction of the schema that help users in exploring complex database schemas.

Existing approaches for schema summarization are data-oriented. In this thesis, we proposed a schema

summarization approach for relational databases by utilizing the database schema and the database doc-

umentation. We proposed a combined similarity measure to incorporate similarities from both sources

and proposed a framework for schema summary generation. Experiments were conducted on a bench-

50

mark database and the results showed that the proposed approach is as effective as the existing data

oriented approaches.

As part of future work, we would like to improve on the limitations of the usability study (mentioned

in section 3.4.3.3). For schema summarization, we would like to come up with approaches to learn the

values of various parameters used in the proposed approach. Also, apart from the database documenta-

tion, documents like the requirement document could be exploited for schema summarization. Lastly,

another research work could focus on developing a unified approach to combine notion of similarity

from schema, data and the database documentation.

51

Related Publications

1. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Exploiting Schema and Documenta-

tion for Summarizing Relational Databases, International Conference on Big Data Analytics,

LNCS Volume 7678, 2012, pp77-99.

2. Ammar Yasir, M. Kumara Swamy and P. Krishna Reddy, Enhanced Query by Object Approach

for Information Requirement Elicitation in Large Databases, International Conference on Big

Data Analytics, LNCS Volume 7678, 2012, pp 26-41.

52

Bibliography

[1] Tiziana Catarci, Maria Francesca Costabile, Stefano Levialdi, and Carlo Batini. Visual query

systems for databases: A survey. Journal of Visual Languages and Computing, 8(2):215–260,

1997.

[2] Moshe M. Zloof. Query by example. In Proceedings of the May 19-22, 1975, national computer

conference and exposition, AFIPS ’75, pages 431–438, New York, NY, USA, 1975. ACM.

[3] Joobin Choobineh. Human Factors in Management Information Systems. Ablex Publishing Corp.,

Norwood, NJ, USA, 1988.

[4] Joobin Choobineh, Michael V. Mannino, and Veronica P. Tseng. A form-based approach for

database analysis and design. Communications of the ACM, 35(2):108–120, February 1992.

[5] Raghu Ramakrishnan and Johannes Gehrke. Database management systems (3. ed.). McGraw-

Hill, 2003.

[6] Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Keyword search in databases: The power of rdbms.

In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data,

SIGMOD ’09, pages 681–694, New York, NY, USA, 2009. ACM.

[7] Arvind Hulgeri and Charuta Nakhe. Keyword searching and browsing in databases using banks.

In Proceedings of the 18th International Conference on Data Engineering, ICDE ’02, pages 431–,

Washington, DC, USA, 2002. IEEE Computer Society.

[8] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. Dbxplorer: enabling keyword search over

relational databases. In Proceedings of the 2002 ACM SIGMOD international conference on Man-

agement of data, SIGMOD ’02, pages 627–627, New York, NY, USA, 2002. ACM.

[9] Jun Sun. Information requirement elicitation in mobile commerce. Communications of the ACM,

46(12):45–47, December 2003.

[10] Jun (John) Sun, Hoh Peter In, and Kuncara Aji Sukasdadi. A prototype of information requirement

elicitation in m-commerce. In 2003 IEEE International Conference on Electronic Commerce (CEC

2003), 24-27 June 2003, Newport Beach, CA, USA, page 53, 2003.

53

[11] Subhash Bhalla, Masaki Hasegawa, Enrique Gutierrez, and Nadia Berthouze. Computational in-

terface for web based access to dynamic contents. International Journal of Computational Science

and Engineering, 2(5/6):302–306, August 2006.

[12] S. Bhalla and M. Hasegawa. Query-by-object interface for accessing dynamic contents on the web.

In TENCON ’02. Proceedings. 2002 IEEE Region 10 Conference on Computers, Communications,

Control and Power Engineering, volume 1, pages 310–313 vol.1, 2002.

[13] Takatoshi Akiyama and Yutaka Watanobe. An advanced search interface for mobile devices. In

Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environ-

ments, HCCE ’12, pages 230–235, New York, NY, USA, 2012. ACM.

[14] Shapiee Abd Rahman, Subhash Bhalla, and Tetsuya Hashimoto. Query-by-object interface for

information requirement elicitation in m-commerce. International Journal of Human Computer

Interaction, 20(2), 2006.

[15] Kazumi Nemoto and Yutaka Watanobe. An advanced search system for learning objects. In

Proceedings of the 13th International Conference on Humans and Computers, HC ’10, pages 94–

99, Fukushima-ken, Japan, Japan, 2010. University of Aizu Press.

[16] M. Hasegawa, S. Bhalla, and T. Izumita. A high-level query interface for web user’s access to data

resources. In Frontier of Computer Science and Technology, 2007. FCST 2007. Japan-China Joint

Workshop on, pages 98–105, 2007.

[17] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical

structures of databases. In Proceedings of the 2008 ACM SIGMOD International Conference on

Management of Data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM.

[18] Cong Yu and H. V. Jagadish. Schema summarization. In Proceedings of the 32nd international

conference on Very large data bases, VLDB ’06, pages 319–330. VLDB Endowment, 2006.

[19] Xue Wang, Xuan Zhou, and Shan Wang. Summarizing large-scale database schema using com-

munity detection. In Journal of Computer Science and Technology, volume 27, pages 515–526.

Springer US, 2012.

[20] Xiaoyan Yang, Cecilia M. Procopiuc, and Divesh Srivastava. Summarizing relational databases.

Proceedings of the VLDB Endowment, 2(1):634–645, August 2009.

[21] Wensheng Wu, Berthold Reinwald, Yannis Sismanis, and Rajesh Manjrekar. Discovering topical

structures of databases. In Proceedings of the 2008 ACM SIGMOD international conference on

Management of data, SIGMOD ’08, pages 1019–1030, New York, NY, USA, 2008. ACM.

[22] Ben Shneiderman. Improving the human factors aspect of database interactions. ACM Transac-

tions on Database Systems, 3(4):417–439, December 1978.

54

[23] C. J. Date. Database usability. In Proceedings of the 1983 ACM SIGMOD international conference

on Management of data, SIGMOD ’83, pages 1–1, New York, NY, USA, 1983. ACM.

[24] Tiziana Catarci. What happened when database researchers met usability. Information Systems,

25(3):177–212, 2000.

[25] H. V. Jagadish, Adriane Chapman, Aaron Elkiss, Magesh Jayapandian, Yunyao Li, Arnab Nandi,

and Cong Yu. Making database systems usable. In Proceedings of the 2007 ACM SIGMOD

international conference on Management of data, SIGMOD ’07, pages 13–24, New York, NY,

USA, 2007. ACM.

[26] Kyu-Young Whang, Art Ammann, Anthony Bolmarcich, Maria Hanrahan, Guy Hochgesang,

Kuan-Tsae Huang, Al Khorasani, Ravi Krishnamurthy, Gary Sockut, Paula Sweeney, Vance Wad-

dle, and Moshe Zloof. Office-by-example: an integrated office system and database manager. ACM

Transactions on Information Systems, 5(4):393–427, October 1987.

[27] E. F. Codd. Relational completeness of data base sublanguages. In: R. Rustin (ed.): Database

Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California, 1972.

[28] Arijit Sengupta and Andrew Dillon. Query by templates: a generalized approach for visual query

formulation for text dominated databases. In Proceedings of the IEEE international forum on

Research and technology advances in digital libraries, IEEE ADL ’97, pages 36–47, Washington,

DC, USA, 1997. IEEE Computer Society.

[29] Michele Angelaccio, Tiziana Catarci, and Giuseppe Santucci. Query by diagram: A fully visual

query system. Journal of Visual Languages and Computing, 1(3):255–273, September 1990.

[30] Antonio Massari, Stefano Pavani, Lorenzo Saladini, and Panos K. Chrysanthis. Qbi: query by

icons. In Proceedings of the 1995 ACM SIGMOD international conference on Management of

data, SIGMOD ’95, pages 477–, New York, NY, USA, 1995. ACM.

[31] Francesca Benzi, Dario Maio, and Stefano Rizzi. VISIONARY: a viewpoint-based visual language

for querying relational databases. Journal of Visual Languages and Computing, 10(2):117–145,

1999.

[32] Norman Murray, Norman Paton, and Carole Goble. Kaleidoquery: A visual query language for

object databases. In Proceedings of the Working Conference on Advanced Visual Interfaces, AVI

’98, pages 247–257, New York, NY, USA, 1998. ACM.

[33] Bin Liu and H.V. Jagadish. A spreadsheet algebra for a direct data manipulation query interface.

In Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, pages 417–428,

2009.

55

[34] Clemente Rafael Borges and Jose Antonio Macıas. Feasible database querying using a visual end-

user approach. In Proceedings of the 2nd ACM SIGCHI symposium on Engineering interactive

computing systems, EICS ’10, pages 187–192, New York, NY, USA, 2010. ACM.

[35] Arnab Nandi and Michael Mandel. The interactive join: recognizing gestures for database queries.

In CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’13, pages

1203–1208, New York, NY, USA, 2013. ACM.

[36] Roberta Evans Sabin and Tieng K. Yap. Integrating information retrieval techniques with tradi-

tional db methods in a web-based database browser. In Proceedings of the 1998 ACM symposium

on Applied Computing, SAC ’98, pages 760–766, New York, NY, USA, 1998. ACM.

[37] Saurabh Sinha, Kirk Bowers, Sandra A. Mamrak, and Ra A. Mamrak. Accessing a medical

database using www-based user interfaces. Technical report, The Ohio State University, 1998.

[38] Magesh Jayapandian and H. V. Jagadish. Automating the design and construction of query forms.

In Proceedings of the 22Nd International Conference on Data Engineering, ICDE ’06, pages 125–,

Washington, DC, USA, 2006. IEEE Computer Society.

[39] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine.

In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages

107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.

[40] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. Xrank: ranked keyword

search over xml documents. In Proceedings of the 2003 ACM SIGMOD international conference

on Management of data, SIGMOD ’03, pages 16–27, New York, NY, USA, 2003. ACM.

[41] Yunyao Li, Cong Yu, and H. V. Jagadish. Schema-free xquery. In Proceedings of the Thirtieth

international conference on Very large data bases - Volume 30, VLDB ’04, pages 72–83. VLDB

Endowment, 2004.

[42] Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. Objectrank: Authority-based

keyword search in databases. In (e)Proceedings of the Thirtieth International Conference on Very

Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 564–575, 2004.

[43] Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, and Aristides Gionis. Automated ranking of

database query results. In CIDR, pages 888–899, 2003.

[44] Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, and Jeffrey Naughton. Combining key-

word search and forms for ad hoc querying of databases. In Proceedings of the 2009 ACM SIG-

MOD International Conference on Management of data, SIGMOD ’09, pages 349–360, New York,

NY, USA, 2009. ACM.

56

[45] Aditya Ramesh, S. Sudarshan, Purva Joshi, and ManishaNaik Gaonkar. Keyword search on form

results. The VLDB Journal, 22(1):99–123, 2013.

[46] Google Directory, http://dir.google.com/.

[47] Open Web Directory, http://dmozcom/.

[48] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm.

PVLDB, 4(12):1466–1469, 2011.

[49] Ricardo Baeza-Yates, Carlos Hurtado, and Marcelo Mendoza. Query recommendation using query

logs in search engines. In Proceedings of the 2004 international conference on Current Trends in

Database Technology, EDBT’04, pages 588–596, Berlin, Heidelberg, 2004. Springer-Verlag.

[50] Zhiyong Zhang and Olfa Nasraoui. Mining search engine query logs for query recommendation.

In Proceedings of the 15th international conference on World Wide Web, WWW ’06, pages 1039–

1040, New York, NY, USA, 2006. ACM.

[51] Arnab Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In Proceed-

ings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07,

pages 1156–1158, New York, NY, USA, 2007. ACM.

[52] Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. Snipsuggest:

context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22–33, October

2010.

[53] Holger Bast and Ingmar Weber. The completesearch engine: Interactive, efficient, and towards ir

& db integration. In CIDR, pages 88–95, 2007.

[54] Guoliang Li, Shengyue Ji, Chen Li, and Jianhua Feng. Efficient type-ahead search on relational

data: a tastier approach. In Proceedings of the 2009 ACM SIGMOD International Conference on

Management of data, SIGMOD ’09, pages 695–706, New York, NY, USA, 2009. ACM.

[55] Peter Anick. Using terminological feedback for web search refinement: A log-based study. In Pro-

ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development

in Informaion Retrieval, SIGIR ’03, pages 88–95, New York, NY, USA, 2003. ACM.

[56] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst. Faceted metadata for image search

and browsing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Sys-

tems, CHI ’03, pages 401–408, New York, NY, USA, 2003. ACM.

[57] Paul Brown, Peter J. Haas, Jussi Myllymaki, Hamid Pirahesh, Berthold Reinwald, and Yannis

Sismanis. Toward automated large-scale information integration and discovery. In Data Manage-

ment in a Connected World, Essays Dedicated to Hartmut Wedekind on the Occasion of His 70th

Birthday, pages 161–180, 2005.

57

[58] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI

Mag., 26(1):83–94, March 2005.

[59] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The

VLDB Journal, 10(4):334–350, December 2001.

[60] S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured and structured

data sources. SIGMOD Record, 28(1):54–59, March 1999.

[61] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. Experiences using dike, a system for

supporting cooperative information system and data warehouse design. Information Systems,

28(7):835–865, October 2003.

[62] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema matching with cupid.

In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages

49–58, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[63] Erhard Rahm, Hong-Hai Do, and Sabine Massmann. Matching large xml schemas. SIGMOD

Record, 33(4):26–31, December 2004.

[64] Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. Mining

database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM

SIGMOD international conference on Management of data, SIGMOD ’02, pages 240–251, New

York, NY, USA, 2002. ACM.

[65] Periklis Andritsos, Renee J. Miller, and Panayiotis Tsaparas. Information-theoretic tools for min-

ing database structure from large data sets. In Proceedings of the 2004 ACM SIGMOD interna-

tional conference on Management of data, SIGMOD ’04, pages 731–742, New York, NY, USA,

2004. ACM.

[66] Yannis Sismanis, Paul Brown, Peter J. Haas, and Berthold Reinwald. GORDIAN: efficient and

scalable discovery of composite keys. In Proceedings of the 32nd International Conference on

Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 691–702, 2006.

[67] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. Morgan

Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011.

[68] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The

VLDB Journal, 10(4):334–350, December 2001.

[69] A. Lund. Measuring usability with the use questionnaire. Usability and User Experience Special

Interest Group, Volume 8, Issue 2, October 2001.

[70] Arnab Nandi and H. V. Jagadish. Guided interaction: Rethinking the query-result paradigm.

PVLDB, 4(12):1466–1469, 2011.

58

[71] AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI

Magazine, 26(1):83–94, March 2005.

[72] TPCE, http://www.tpc.org/tpce/.

[73] Charles L. A. Clarke, Gordon V. Cormack, D. I. E. Kisman, and Thomas R. Lynam. Question

answering by passage selection (multitext experiments for TREC-9). In Proceedings of The Ninth

Text REtrieval Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000,

2000.

[74] Abraham Ittycheriah, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi, and Richard J. Mam-

mone. Ibm’s statistical question answering system. In Proceedings of The Ninth Text REtrieval

Conference, TREC 2000, Gaithersburg, Maryland, USA, November 13-16, 2000, 2000.

[75] Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval in full text information

systems. In Proceedings of the 16th annual international ACM SIGIR conference on Research and

development in information retrieval, SIGIR ’93, pages 49–58, New York, NY, USA, 1993. ACM.

[76] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative eval-

uation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual

international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR

’03, pages 41–47, New York, NY, USA, 2003. ACM.

[77] Mengqiu Wang and Luo Si. Discriminative probabilistic models for passage based retrieval. In

Proceedings of the 31st annual international ACM SIGIR conference on Research and development

in information retrieval, SIGIR ’08, pages 419–426, New York, NY, USA, 2008. ACM.

[78] C. S. Khoo W. Xi, R. Xu-Rong and E.P. Lim. Incorporating window-based passage-level evidence

in document retrieval. In Journal of Information Science, volume 27, pages 73–80, 2001.

[79] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gat-

ford. Okapi at TREC-3. In Proceedings of The Third Text Retrieval Conference, TREC 1994,

Gaithersburg, Maryland, USA, November 2-4, 1994, pages 109–126, 1994.

[80] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute simi-

larities. Proceedings of the VLDB Endowment, 2(1):718–729, August 2009.

[81] M.E Dyer and A.M Frieze. A simple heuristic for the p-centre problem. Operations Research

Letter, 3(6):285–288, February 1985.

59

Documents

Enhancing the Query by Object Approach using Schema ...web2py.iiit.ac.in/research_centres/publications/download... · Enhancing the Query by Object Approach using Schema Summarization