collectionscanada.cacollectionscanada.ca/obj/s4/f2/dsk3/ftp04/mq30577.pdf · Abstract A major objective in knowledge discovery in Internet database research is to sup port exploration

KNOWLEDGE DISCOVERY IN INTERNET DATABASES

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES AND RESEARCH IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF MASTER OF SCIENCE

IN

COMPUTER SCIENCE UNIVERSITY OF REGINA

BY Xiaobo Yu

Regina, Saskatchewan

December 22, 1997 -.

@ Copyright 1997: Xiaobo Yu

395 Wellington Street 395, nie Wellington Ottawa ON KI A ON4 Ottawa ON K1A ON4 Canada Canada

Your lile Volm réfemnce

Our lile Notre rdldrence

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant a la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in microfom, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substmtial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author' s ou autrement reproduits sans son permission. autorisation.

Abstract

A major objective in knowledge discovery in Internet database research is to s u p

port exploration and analysis of large amounts of data from severd databases, each

available via the Internet. This thesis describes an approacb to achieving this objec-

tive based on a multidatabase. The multidatabase system provides a single front-end

for several autonomous, heterogeneous database management systems.

A prototype software system, called KDTD, has been developed to perform dis-

covery tasks on Internet databases. A discovery task is decomposed into parameters

for the task and a global database query. The global query is translated and decom-

posed into a set of local database queries, which are sent to Internet databases by

database agents. KDID standardizes and accumulates the results of the local queries

in a single database called the multidatabase. Knowledge discovery is then performed

on the retrieved data by a discovery tool, DB-Discover, which performs high Ievel,

dynamic summarization and generalization of large amounts of data.

The approach is based on a global schema, which describes some related data. The

correspondence between this global schema and t h e individual databases is maintained

in a central registry. A registration subsystem is included in KDID to register Internet

databases. The subsystem interacts wit h database administrators to obtain database

schemas and integrate them with the global schema.

Acknowledgement

First, 1 must thank rny supervisor and mentor, Dr. Howard Hamilton, for his

guidance, critical advice, encouragement, and patience, without which this thesis

might not have been completed. 1 am grateful to hirn for accepting me as one of

his students, for believing in my ability to obtain this degree, for providing access

to equipment, and for arranging financial support for me. An excellent supervisor,

researcher, educator, and a fine human being, Dr. Hamilton has left a profound and

lasting impression on me.

Next, 1 want to thank members of my thesis cornmittee: Dr. Larry Saxton and Dr.

Menchi Liu provided helpful comments, and Dr. Ken Runtz served as the external

examiner. The Institute for Robotics and Intelligent Systems, the Faculty of Graduate

Studies and Research, and the Department of Computer Science provided much-

appreciated hancial support. 1 dso thank al1 rnembers of the Department.

1 a m also greatly indebted to Dr. Edmund H. Dale, Professor Emeritus, and Miss

Anne Rigney for helping me to come to the University of Regina to continue my

studies. Without their help, 1 might not have b e n able to come to Canada; and

having come, Dr. Dale was always there to offer support, advice and encouragement.

His belief in me gave me confidence and determination to succeed in a culture and

environment that were completely new to me. 1 also thank my friends Allan and

Sharon Schmidt, Stuart and Yvonne Mann, and Len Morrison for their valuable

friendship and help, which made my stay in Regina pleasant . Many thanks go to ali my friends, Chu Tongsheng, Gai Huifa, Hu Qiang, Shu Jun,

Wang Changwen, Xie Yongzeng, Xing Minqing and Xu Zhan, and their families, for

al1 their help and friendship. Special thanks are due to Brock Barber, for his valuable

and friendly help with DB-Discover. 1 also thank my fellow graduate students and

office mates, Carlos Rivera, Colin Carter, Li Liangchun, Pang Wanglin, Sivakumar

Nagarajan, November Scheidt and Zhang Jian, who had made this Iearning experience

enj oyable.

Especially, 1 am indebted to my parents and younger brother in China for their

constant love and understanding. Lat, but not least, 1 am indebted to my wife, Fu

Lei, for her unwavering love, devotion and understanding, witbout which this thesis

could not have been finished.

Contents

Abstract i

Acknowledgement

Tàble of Contents iv

List of Tables v

List of Figures vi

Chapter 1 Introduction 1

Chapter 2 Background and Related Research 6

. . . . . . . . . . . . . . . . . . 2.1 The Internet and Internet Databases 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 TheInternet 6

. . . . . . . . . . . . . . . . . . . . . 2.1.2 The Client-Server Mode1 7 . . . . . . . . . . . . . . . . 2.1.3 The Hypertext Transfer Protocol 8

. . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 TheHTMLForms 9 . . . . . . . . . . . . . . . . . . . . . . 2.1.5 The World Wide Web 11

. . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Internet Databases 12

. . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Multidatabase Mode1 13

. . . . . . . . . . . . . . . . . . . 2.3 Knowledge Discovery in Databases 15

. . . . . . . . . . . . . . . . . 2.3.1 Types of Discovered Knowledge 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 DB-Discover 17

. . . . . . . . . . . . . . . . . . . 2.4 Resource Discovery in the Internet 18

. . . . . . . . . . . . . . . 2.4.1 The Client Directory Server Mode1 18 . . . . . . . . . . . . . 2.4.2 The Multiple Layered Database Mode1 20

Chapter 3 An Overview of the KDID System 23

. . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Architecture of KDID 23 . . . . . . . . . . . . . . . . . . . . . . 3.1.1 TheInternetF'ront.end 25

. . . . . . . . . . . . . . . . . . . . . . . 3.1.2 The Interface Module 26 . . . . . . . . . . . . . . . . . . . . 3.1.3 The Multidatabase Module 28

. . . . . . . . . . . . . . . . . . 3.1.4 The KDD Application Module 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Databases Types 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Overview 32

. . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Oracle Databases 33 . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Mini SQL Databases 33

. . . . . . . . . . . . . . . . . . . 3.2.4 Microsoft Access Databases 34

3.3 The Multidatabase Architecture for the KDID System . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The Architecture 35 . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Data integration 37

. . . . . . . . . . . . . . . . . . 3.3.3 Multidatabase Query Processing 41

Chapter 4 Design Issues 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Interface Module 43

. . . . . . . . . . . . . . . . . . . . . . 4.1.1 The Form Data Parser 44

. . . . . . . . . . . . . . . . . . . . 4.1.2 Global Query Composition 46 . . . . . . . . . . . . . . . . . . . . . . 4.2 On-line Database Registration 48

. . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Security Issues 48

. . . . . . . . . . . . . . . . . . . 4.2.2 The Registration Approach 49

. . . . . . . . . . . . . . 4.3 Global Query to Internal Query Translation 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Query Decomposition 53

. . . . . . . . . . . . . . . . . . 4.4.1 The Decomposition Algorithm 53

4.4.2 Query Decomposition Examples . . . . . . . . . . . . . . . . . 53

. . . . . . . . . . . 4.5 Mechanisms for Resolving Data Value Differences 58

4.5.1 Data Value Standardization . . . . . . . . . . . . . . . . . . . 58

. . . . . . . . . . . . . . . . . . . 4.5.2 Scale Conversion Functions 62

Chapter 5 Prototype Design and Testing 63

. . . . . . . . . . . . . . . 5.1 Constructing a Test Multidatabase System 63 . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Global Schema 64 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The D l Database 65 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 The D2 Database 66 . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 The D3 Database 66

. . . . . . . . . . 5.1.5 Relationships Between the Three Databases 67 . . . . . . . . . . . . . . . . . . . . . . 5.2 Database Registration Process 68

. . . . . . . . . . . . . . . . . . . 5.3 Typical Knowledge Discovery Tasks 71 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Discovery Task 1 72 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Discovery Task 2 73

. . . . . . . . . . . . . . 5.3.3 Task Realization Using HTML Forms 74 . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 DiscoveryResults 77

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Performance Analysis 77

5.4.1 Parallel Data Retrieval from Participating Databases . . . . . 77 . . . . . . . . . . . . . . . . . . . 5.4.2 Data Value Standardization 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Discussion 84

Chapter 6 Conclusion 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Summary 85

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Contributions 87 . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Areas for Future Research 88

Bibliography 89

Appendix A Schema Information for the D l Database 99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . A.l The AWARD Table 99

. . . . . . . . . . . . . . . . . . . . . . . . . A.2 The DISCIPLINE Table 100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 The AREA Table 100

Appendix B Schema Information of Database D2 101

. . . . . . . . . . . . . . . . . . . . . . . B.l The SCHOLARSHIP Table 101 . . . . . . . . . . . . . . . . . . . . . . . . B.2 TheCOMMITTEETable 102 . . . . . . . . . . . . . . . . . . . . . . . . B.3 The GRANT-TYPE Table 102

Appendix C Schema Information of Database D3 103

. . . . . . . . . . . . . . . . . . . . . . C.l The ORGANIZATION Table 103

Appendix D GIossary 104

vii

List of Tables

. . . . . . . . . . . . . . . . . . . . . . 2.1 Typical Network Domain Types 7

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 ValuesoftheTYPETag 9 . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Example Internet Databases 13

. . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Sample Server Database 19

. . . . . . . . . . . . . . . . 2.5 A Sample Portion of a Relation in Layer 1 21

. . . . . . . . . . . . . 2.6 A Generalized Portion of a Relation in Layer 2 22

. . . . . . . . . . . . 3.1 Functions Used to Contact the Mini SQL Server 34

. . . . . . . . . . 3.2 ODBC Functions Used to Contact a Database Server 35

. . . . . . . . . . . . 3.3 Data Types in Oracle, Mini SQL and MS Access 39 . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Format of Schema Mapping 50

. . . . . . . . . . . . . . . . . . . . . . . 4.2 Examples of Schema Mapping 51

4.3 Database Information for the Example Multidatabase . . . . . . . . . 55

. . . . . . . . . . . . . . . . 4.4 Example of Provinces and their Variations 61

. . . . . . . . . . . . . . . . . . . 5.1 Global Schema of the AWARD Table 64

5.2 Global Schema of the ORGANIZATION Table . . . . . . . . . . . . . 64

. . . . . . . . . . . . . . . . . . . 5.3 Schema of the GRANT-TYPE Table 66

5.4 Cornparison Between the AWARD and SCHOLARSHIP Tables . . . . 67

5.5 Time for Sequential and Parallel Re t r ied . . . . . .. . . . . . . . . . 80

5.6 Standardization Time with 32K Memory and a 23788-Item Dictionary . 81

5.7 Standardization Tirne with 64K Memory and a 47576-Item Dictionary . 83

5.8 Standardization Time with 64K Memory and a 127576-Item Dictionary . 84

. . . . . . . . . . . . . . . . . . . . . . A . l Schema of the AWARD Table 99

. . . . . . . . . . . . . . . . . . . . A.2 Schema of the DISCIPLINE Table 100

. . . . . . . . . . . . . . . . . . . . . . . A.3 Schema of the AREA Table 100 . . . . . . . . . . . . . . . . . . B.l Schema of the SCHOLARSHIP Table 101

. . . . . . . . . . . . . . . . . . . B.2 Schema of the COMMITTEE Table 102 . . . . . . . . . . . . . . . . . . B.3 Schemaof theGRANT-TYPE Table 102

. . . . . . . . . . . . . . . . . C.1 Schema of the ORGANIZATION Table 103

List of Figures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Servers 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . Internet User Growth 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Client-Server Mode1 7 . . . . . . . . . . . . . . . . . . . An Example of an HTML Form Script 10

. . . . . . . . . . . . . . . . . . . . . . . . . An HTML Form Example 12

. . . . . . . . . . . . . . . . . Architecture of a Multidatabase System 14

. . . . . . A Concept Hierarchy for the Province Attribute: Tree View 18 . . . . . . . . . . . . . . . . . . . . . . . Client Directory Server Mode1 19

. . . . . . . . . . . . . The Format of a Query to the Directory Server 19

. . . . . . . . . . . . . . . . . . . . Multiple Layered Database Mode1 20 . . . . . . . . . . . . . . . . . . . . . . . . . The Architecture of KDID 24

. . . . . . . . . . . . . . . . . . . . . . . . . Software Hierarchy Chart 24 . . . . . . . . . . . . . . . . . . . . . An HTML Form with User Data 27 . . . . . . . . . . . . . . . . . . . . . An Example of a Form Data Set 27

. . . . . . . . . . . . . . . . . . . . . . Processed (Name, Value) Pairs 28 . . . . . . . . . . . . . . . . . . . . A Tabbed Concept Hierarchy File 31

. . . . . . . . . . . . . . . . . Architecture of the KDID Multidatabase 36 . . . . . . . . . . . . . . . . . . . . . . . . . . Type Mapping Diagram 40

. . . . . . . . . . . . . . . . . . . Flow Diagram for Query Processing 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . The Forrn Data Parser 45

. . . . . . . . . . . . . . . . . . . . . . . . Global Query Composition 47 . . . . . . . . . . . . . . . . . . . . . . On-line Database Registration 50

. . . . . . . . . . . . . . . Global Query to Internal Query Translation 51

. . . . . . . . . . . . . . . . . . . . 4.6 Procedure pro~essall~redicates 52 . . . . . . . . . . . . . . . . . . . . . . 4.7 Query Decompostion Algorithm 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Global Query 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Internal Query 56

. . . . . . . . . . . . . . . . . . 4.10 Local Query Submitted to DBi, DB2 57

. . . . . . . . . . . . . . . . . . . . . . 4.11 Local Query Submitted to DB3 57 . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Data Value Standardkation 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Procedure lookup 60 . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 A Hash Table Exarnple 61

. . . . . . . . . . . 5.1 A Concept Hierarchy for the PROVINCE Attribute 65 . . . . . . . . . . . . . . . . . . . . 5.2 Relations among the Three Tables 67

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Secure Site Certificate 68

. . . . . . . . . . . . . . 5.4 User Database Registration Application Form 69 . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Form for Schema Mapping 70

. . . . . . . . . . . . . . 5.6 User Database Registration Acknowledgement 71 . . . . . . . . . . . . . . . . . . . 5.7 A S QL-like Query for Discovery Task 1 72

. . . . . . . . . . . . . . . 5.8 SQL Query for Task 1 After Transformation 72

5.9 Query 1 for Task 1 for the Dl Database . . . . . . . . . . . . . . . . . 73

. . . . . . . . . . . . . . . . . 5.10 Query 2 for Task I for the D3 Database 73

. . . . . . . . . . . . . . . . . 5.11 Query 3 for Task 1 for the D2 Database 73

. . . . . . . . . . . . . . . . . 5.12 An SQGlike Query for Discovery Task 2 74

. . . . . . . . . . . . . . . 5.13 Query 1 for Task 2 for the D l DATABASE 74



. . . . . . . . . . . . . . . . . 5.16 Query 4 for Task 2 for the D3 Database 75 . . . . . . . . . . . . . . . . . . . . 5.17 HTML Form for Discovery Task 1 76

5.18 HTML Form for Discovery Task 1 with User Data . . . . . . . . . . . 78

5.19 Final Result for Discovery Task 1 . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . 5.20 Retrieved Data without Generalization 80

5.21 Test Time for Parallel and Sequential Retrieval . . . . . . . . . . . . . 82

Chapter 1

Introduction

Since the amount of information connected to the Internet is growing rapidly, an

increasing potential for knowledge discovery in this information exists. As shown in

Figure 1.1 taken from [64], the number of servers connected to the Internet grew from

one thousand in 1984 to more than 16 million in January, 1997. Additionally, as

shown in Figure 1.2, the number of users grew from fewer than one million in 1993

to more than 47 million in January 1997 (651.

In the 1990s, internet information retrieval tools, such as Wide Area Information

Servers (WAIS), Archie, Prospero, Gopher, the World Wide Web (WWW), Netfind,

X.500, and Indie have been developed to help users find interesting information in the

Internet [68]. These tools facilitate browsing, searching, and organizing information

accessible via the Internet, but they do not provide knowledge discovery techniques

for structured data sources, such as databases, connected to the Internet.

KnowIedge discovery provides a means of coping with the massive amount of data

produced daily. Users want to find knowledge hidden in this information. Knowl-

edge discovery is "the non-trivial process of identifying valid, novel, potentially useful,

and ultimately understandable patterns in data" [34]. If the output of a knowledge

discovery technique is a pattern that is considered interesting, it can be considered

as "knowledge". Knowledge discovery combines database and machine learning tech-

niques. Many algorithms for discovering association rules, classification rules, sequen-

tial patterns, and time sequences have been proposed for knowledge extraction from

databases and have been applied successfuUy to large relational databases [34], but

Figure 1.1: Number of ~ e r v e n [Gd]

they have not been applied to the global information base, i.e., al1 data anilable via

the internet.

There are several complications to applying knowledge discovery techniques to the

global information base: the number of sites, the fast growing user community, the

limited bandwidth, the great amounts of data, the unstructured nature of much of

this data, and the slow speed of data analysis. Included in the global information base

are many databases, which due to their structured nature are arnenable to efficient

analysis.

An Intemet database is a database which provides access to Internet users. Often,

an Internet database provides authoritative information in a specific category with

WWW as its database management interface. For example, the Semaphore Corpora-

tion's Internet databases include a complete postal database with al1 United States

addresses, zip codes and mail carrier route numbers. In this thesis, attention is re-

stricted to relational databases with interfaces accepting Structured Query Language

(SQL) queries, which are the most common type of commercial databases on the

Figure 1.2: lnternet User Growth.

Knowledge discovery in Internet databases concerns the application of knowledge

discovery in database techniques to multiple relational databases available on the

Internet. The goal is to analyze more data than can be analyzed by a user without

automated support.

In this thesis, a multidatabase approach is introduced for conducting knowledge

discovery in Internet databases. The approach has been embodied in the Uowledge

Discovery in Mernet Databases (KDID) software, which was designed and imple- - mented as part of this thesis research.

There are four main reasons why constructing such a mode1 is worthwhile. First,

the multidatabase approach provides a means of using databases already connected

to the Internet. According to the Gale Directory of Databases [49], the number of

on-line databases increased from 4,000 in 1989 to 10,000 databases in 1996. These

databases are distributed worldwide and cover a large variety of sub jects areas. They

are provided by 1,800 on-line services and database distributors. A multidatabase

approach is appropriate because the databases are autonomous, heterogeneous, and

geographically dispersed. As well, databases containing similar information may use

different schemas. Building a multidatabase system allows the use of a single front-end

for many different database management systems.

Secondly, the muitidatabase approach extends the scope of existing knowledge dis-

covery from database systems frorn a single database to multiple databases. Several

existing knowledge discovery in database software packages permit the discovery of

useful information from large amounts of data stored in a single relational database.

For example, the DB-Discover research software system for knowledge discovery is

useful for data access and summarization for a single relational database [18]. It

allows high level, dynamic organization of data without modifying the data in any

database. Provided with access to a relational multidatabase, DB-Discover can sum-

marize interesting knowledge from multiple relational databases instead of a single

dat abase.

Thirdly, the proposed approach provides a convenient way for database adminis-

trators to make additional databases available for discovery tasks. By using Hypertext

Markup Language (HTML) forms, a database administrator can register a database

with the KDID system. As well, the relationship between data in this database and

the global schema used by the KDID system can be conveniently defined.

Fourthly, the approach provides a convenient way for users to initiate discovery

tasks. The KDID system is accessible via Internet browsers, such as Netscape. Us-

ing HTMLbased forms provided by KDID, a user can specify a DB-Discover task,

including what data are to be retrieved from the multidatabase. The usefulness of

the discovered knowledge is affected by how well the user understands the domain

described by the retrieved data 161. HTML forms help users understand the domain.

For exarnple, if the user tries to discover information relating patient age with the

occurrence of diseases, it is more meaningful and convenient if he/she is able to select

items from a form based on a database schema for medical information than if he/she

must compose SQL queries. The incorporation of an HTML-based interface in the

KDID system makes it easier for any user to initiate discovery tasks without the aid

of database specialists.

The main goal of this thesis is to demonstrate the feasibility of conducting knowl-

edge discovery from Internet databases. The thesis focuses on three particular tasks:

on-line registration of user dat abases, translation of discovery t asks entered on a

World Wide Web (web) page into a series of queries on appropriate databases, and

the integration of results from t hese databases.

Chapter 2 presents background material required to place the design of the KDID

system in context. The background material concerns the Internet and Internet

databases, multidatabases, and techniques for knowiedge discovery in databases. A

survey of relevant research on resource discovery in the Internet is also presented.

Examples are supplied to illustrate each model of resource discovery.

Chapter 3 is an overview of the KDID system for knowledge discovery in Internet

databases. It describes the general architecture of the KDID system and each com-

ponent. Relevant information concerning database types is also presented. The main

focus is on the multidatabase model, the platform for al1 knowledge diacovery tasks.

Chapter 4 presents the design issues encountered during the implernentation of

a prototype KDID system. An approach for manipulating schema information and

algorithms for query processing in the multidatabase context are described and ex-

amples are supplied. Mechanisms for resolving differences in data values are also

presented.

Chapter 5 describes the creation of a multidatabase for testing, the on-line database

registration, and the results of some knowledge discovery tasks. The performance of

the KDID system is also analyzed.

Chapter 6 presents conclusions. It surornarizes the contributions of this research

and discusses possible areas for future research. ,

Chapter 2

Background and Related Research

In this chapter , background material and related research are presented. Since

this thesis concerns the application of knowledge discovery techniques to Internet

databases using a multidatabase approach, three appropriate background subjects

are the Internet and Internet databases, multidatabases, and knowledge discovery

techniques. Section 2.1 descri bes the Internet, the client-server model, the HTTP

protocol, HTML forms, the WWW, and Internet databases. Section 2.2 explains the

multidatabase model. Section 2.3 describes the techniques applied in knowledge dis-

covery in databases. Section 2.4 presents a survey of approaches to resource discovery

in the Internet, which is the most relevant topic in the literature to the subject of

this thesis.

2.1 The Internet and Internet Databases

2.1.1 The Internet

The Indernet is the worldwide collection of inter-connected computer networks

and gateways that use the Internet Protocol (IP) and function as a single cooperative

network [48]. The Internet provides three levels of network services: connectionless

packet delivery, connection oriented streams, and application level services [48]. In-

formation can be transferred in electronic fonn via communication paths, such as

optical fiber lines and satellites, around the globe through the Internet.

Domain Code ed u

Table 2.1 : Typical Network Domain Types.

com g*V

mil net

_ 0% ca CII

uk

Figure 2.1: Client-Server Model.

Meaning educational oreanization

Any computer system directly c o ~ e c t e d to the network has a domain name and an

IP address. A domain name is typically of the form system.site.domain, for example,

gopher.voa.gov. The most common domain types are shown in Table 2.1. An IP address is a unique 32 bit unsigned integer assigned to a computer.

Example iwww .sims. ber kelev.edu

commercial organization US governmental organization US military organization ,

network organization non-profi t organization Canada China the United Kingdom

2.1.2 The Client-Server Model

www. bypass.com gopher .voa.gov web.nps.navy.mil www.myhomepage.net ftp.ifcss.org mercury.cs.uregina.ca www.qd.sd.cn www.cornp.brad.ac.uk

To understand how the Internet operates, the concept of client-server comput ing

is crucial. The client-semer model, as shown in Figure 2.1, is a form of distributed

computing that divides the application processing between a client and a server that

are connected by a network. A client process, which often corresponds to a user

interface, sends a request to a server process. The server receives the request, performs

the appropriate action, and returns the result to the client process 1501. Typically, a

server waits for any request messages sent to a particular port on a machine. Client-

server processing requires reliable communication between clients and servers.

With a client-server architecture, it is possible to create an interface that is in-

dependent of the server machine hosting the data. Therefore, the user interface of a

client-server application can be run on a WindowsNT cornputer and the server can

be run on a mainframe. Clients can be also written for DOS-based or UNIX-based

computers. This allows information to be stored in a central server and disseminated

to different types of remote computers.

2.1.3 The Hypertext 'Ikansfer Protocol

The Hypertezt Tmnsfer Protocol (HTTP) is an application-level protocol for dis-

tributed, collaborative, hypermedia information systems [9]. The HTTP protocol is

based on the client-server mode1 described in Section 2.1.2. A client establishes a con-

nection with a server and sends a request consisting of a request method, a Uniform

Resource Locator (URL), and the body of the request. The server responds with a

status line and the body of the reply. These ideas are now explained in more detail.

Most HTTP communication is initiated when a client process, such as a browser,

sends a request to a server controlling resources. HTTP communication takes place

over Transmission Control Protocol/Internet Protocol (TCP/IP) connections. The default is TCP port 80 [8], but other ports can be used; for exarnple, the port number

for the HTTP server on chiron.cs.ure.gina.ca is 8050.

An HTTP Uniform Resource Locator (URL) is used to specify the location of

network resources via the HTTP protocol [9]. The syntax for an HTTP URL is

''http:" "//" host [ ":" port ] [ path 1. The identified resource is located by the HTTP server process listening for TCP con-

nections on the specified port of the host. The path identifies the full path of the re-

source. For example, http://ch~on.cs.ureginaaca:8050/htb~soeaction means that

an HTTP server is listening for TCP connections on 8050 of chiron.cs.uregina.ca. The

directory on chiron is called /htbin and the resource is sorne-action.

The body of the request contains the actual data to be transferred preceded

by several optional header lines. The data type of the body is determined via the

CONTENT-TYPE and CONTENT-ENCODING header fields. A CONTENT-TYPE

header specifies the media type of the data, and a CONTENT-ENCODING header

indicates an additiond encoding technique applied to the content. The CONTENT-

LENGTH header indicates the size of a request or reply.

The request methods for HTTP include GET and POST. The GET method re-

trieves whatever data is identified by the URL. If the URL refers to a data producing

process, the produced data is returned in the response rather than the source text

of the process. The POST method requests that the destination server accept the

body of the request as a new subordinate of the resource identified by the URL. For

exarnple, POST might be used when the body of the request is a filled-in form being

submitted. The actual function performed by the POST method is determined by

the server and dependent on the URL 191.

2.1.4 The HTML Forrns

The Hypertezt Markup Language (HTML) is a simple data format used to create

hypertext documents that are portable from one platform to another [8). A document

in HTML consists of a mixture of HTML cormqands and m y number of American

Standard Code for Information Interchange (ASCII) texts.

A form is a template for a form data set and an associated method and action

URL. INPUT, SELECT, TEXTAREA are tags used to specify user-entered inputs

with the form. The INPUT tag specifies a simple input inside a form. Attributes

to INPUT are TYPE, NAME, VALUE, CHECKED, SIZE, and MAXLENGTH. The

TYPE tag has the six possible values shown in Table 2.2.

f Vdue 1 Ex~lanation text password

Default text entry field Text entry. Characters entered represented as asterisks

checkbox radio submit

Table 2.2: Values of the TYPE Tag.

A single toggle button A single button A push button causing the current form to be assembleci

reset into a query URL and sent to a remote server A push button causing the various input elements in the form to be reset to their default values

Any number of SELECT tags is allowed inside one form and they can be in- terrnixed with other HTML commands and plain text. SELECT has opening and

closing tags, i.e., (SELECT) and (/SELECT). Each SELECT tag has a sequence of

OPTION tags. The attributes to SELECT include NAME, SIZE, MULTIPLE.

Figure 2.2 gives an example of an HTML f o m script. The value of eacb INPUT TYPE determines a form action. For example, INPUT TYf E="RADIOn produces

a radio button on the form. A SELECT tag c m have multiple values if the tag

MULTIPLE follows. For example, the TARGET options are "amount " , 'province",

"orgnarne", and 'dept". Several or al1 of them c m be selected at the same tirne. -- -

(FORM ACTION= Uhttp://chiron.cs.uregina.ca:8050/htbin/kdid METHOD= "POST") Sort by (SELECT NAME="SORTn) (OPTION SELECTED = "SELECTEDn)disc~ode (0PTION)area-code (/SELECT) (INPUT TYPE="RADIOn NA.ME="ORDERn VALUE= "ASC" CHECKED) Ascending (INPUT TYPE = uR.ADIO" NAME = "ORDER" VALUE = "ASC')Deseending Choose a concept hierarchy file (SELECT NAME = "LOAD") (OPTION SELECTED = "SELECTED")nserc.chf (OPTION) pas1 .chf (0PTION)pas.chf (/SELECT) Select target attribut es first (SELECT NAME = UTARGETn MULTIPLE) (OPTION SELECTED = "SELECTEDn)amount (0PTION)province (0PTION)org~ame (0PTION)dept (/SELECT)

(INPUT TYPE = "SUBMIT" VALUE = "QUERY") (INPUT TYPE = "RESETn VALUE = URESET")

Figure 2.2: An Example of an HTML Form Script.

A form data set is a sequence of (NAME, VALUE) pairs. The names are specified

by the N A M E attribute of form inputs, and the values are either by default given

by the form or filled in by a user. The resulting form data set is used to access

information service as a function of the action and method [8].

In order to protect form data from misinterpretation during transmission and pro-

cessing, a number of encoding methods can be us&. The default encoding for al1 forms

is applicat ion/x-wu-f orm-urlencoded. When a form is appl icat ion/x-wu-

f orm-urlencoded, the form field names and d u e s are escaped. Space characters are

replaced by the "+" sign, and non-alphanumeric characters are replaced by "%HHn,

Le., a percent sign and two hexadecimal digits representing the ASCII code of the

character. Fields are listed in the order they appear in the HTML script with the

name sepamted from the value by an '=" sign, Gd the pairs are separated from each

other by an '&" sign.

An HTML browser, such as Netscape, processes a form by presenting the HTML script with each field in its initial state. The user c m modify the fields. When

the user submits the form, the form data set is processed according to its method

and action URL. For example, the action URL in the form shown in Figure 2.2

is http://chiron.cs. uregina.ca:8050/htbin/kdid, and the method is POST. Therefore,

when the user clicks on the submit button, an HTML form appears, as shown in

Figure 2.3 on the browser.

2.1.5 The World Wide Web

The World Wide Web ties together vast amounts of information stored on the

Internet. The World Wide Web is a collection of mutually referencing hypertext doc-

uments scattered across the Internet, and serviced by HTTP servers. Each server

provides access to the information at its site to local and rernote clients. Web docu-

ments are often written in HTML. By clicking on highlighted words on a web page,

the user jumps to another document linked to the highlighted text. Netscape is a

client application for the World Wide Web.

HTML documents link to each other by including URLs of documents they want

to reference. URLs can point either to static documents or to scripts, such as that

shown in Figure 2.2, which dynamically generate documents. The function of an

HTTP server is to accept URLs from clients, interpret those URLs as either local

Figure 2.3: An HTML Form Example.

document addresses or invocations of local scripts which produce documents, and

to transmit the resulting HTML documents to the client. The function of an KTTP client, such as Netscape, is to display documents received from servers and to transmit

URLs to servers in response to interactive events.

2.1.6 Internet Databases

As described in Chapter 1, an Internet database is a database with access p r e

vided to Internet users. Here we limit attention to relational databases that use the

American National Standards Institute (ANSI) Structured Query Language (SQL) statements to manipulate data. Internet databases run in distributed, client-server

environments. Clients of an Internet database are allowed to be on any Internet host.

Some Internet databases are accessed directly as SQL servers, i.e, servers that re-

ceive SQL queries as requests. Other Internet databases use WWW pages as their

database management interface. A user of such a database communicates with it

with interactive and dynarnic HTML forms.

Typically, an Internet database with a web interface has an underlying database

server. User requests are translated into SQL (or another query language) and trans-

mitted to the database server. The results are returned to the user via the web.

The HTTP server for the web interface and the database server may be on the same

machine or different machines. In the approach described in this thesis, a separate

database server is required for the Internet databases to be accessed by the KDID

system.

Table 2.3 shows 10 Internet databases and their data categories. For exarnple,

the College Theatre Internet Database provides a platform to share the college the-

atre experience for college theatre clubs and departments among US colleges and

universities.

Table 2.3: Example Internet Databases.

Internet Database College Theatre internet Database Showa Chemical Internet Database Past Pages Internet Database Florida Film Locations Internet Database Nexus Database ASFA Database Embase Database Environmental Sciences & PoIlu tion Management Database Findex Database Metadex Database

2.2 The Multidatabase Mode1

Subject College theatre experience Chemical products information on paperbacks, cheaper hardcovers, magazines, postcards, prints Location pictures from various film commissions in Florida Classifieci advertisements Aquatic sciences and fisheries abstracts Clinical and experimental aspects of art hritis Aspects of environmental sciences

Market research reports Aspects of metallurgical science and technology

A multidatabase system provides integrated access to autonomous, heterogeneous

databases via a single, relatively simple request [47]. Multidatabase systems facili-

tate the sharing of information arnong different departments of a Company or among

Figure 2.4: Architecture of a Multidatabase System.

different cornpanies where database management systems are incompatible with each

other. With a multidatabase system, a user does not have to send multiple requests

in different languages to multiple information sources. Instead the user sends one

query to the multidatabase system which handles the details.

Typically, a multidatabase system includes a rnultidatubase, which js a database

that acts as a front-end to multiple, possibly heterogeneous local database manage

ment systems [47], as shown in Figure 2.4. A local database management system

does not have to make any modification and retains full control over local data and

processing. Cooperating with the global system by serving global requests is strictly

voluntary.

A key aspect of multidatabases is site autonomy. Each site determines indepen-

dently what information it will share with the global system, and what global requests

it will service [47]. Global changes, such as the addition and deletion of other sites, do

not have any effect on the local database management systems. The multidatabase

mode1 is desirable because the capital invested in hardware, software and user train-

ing for the local databases is preserved. As well, site autonomy acts as a security

measure (471, because the local database management system bas full control over

what processing options are allowed.

A number of multidatabase systems have been developed for production use, in-

cluding the DATAPLEX system and the Integrated Manufaduring Data Administra-

tion System (IMDAS).

DATAPLEX is a distributed database management system. It allows queries to

retrieve and update data managed by diverse database systems. The relational mode1

is used as the global data modei to provide a uniform user interface. The DATAPLEX

system has an SQL Parser, a Distributed Query Decomposer and a Data Dictionary

Manager. The SQL Parser checks the syntax of the SQL statements and determines

their meaning. The Distiibuted Query Decomposer decomposes an SQL query into

a set of local queries. As well, it determines the location of the user and merges al1

results at the user's location. The Data Dictionary Manager finds the location of the

data referenced by a query [81].

The IMDAS system was developed to provide access to sources of manufacturing

data and to allow new application programs to be added without changing the existing

databases. The IMDAS system uses an SQLlike query language. Each local database

system has a Basic Data Semer to provide the interface between the local database

management system and the integrated database system 18 11.

2.3 Knowledge Discovery in Databases

The notion of finding useful patterns in data has been given a variety of names,

including data mining, knowledge extraction, information discovery, information har-

vesting, data archadogy, and data pattern processing 1201. The term knowledge

discovery in databases ( K D D ) was coined by Piatetsky-Shapiro in 1989 (331. This

term emphasizes that knowledge is the end product of a data-driven discovery pro-

cess.

KDD has evolved, and continues to evolve, from the intersection of the research

fields of machine learning, pattern recognition, databases, statist ics, knowledge ac-

quisition from experts systems, data visualization, and high performance computing

1201. Techniques from machine leming , pattern recognition, and st at ist ics are be-

ing adapted to serve in KDD components. An important factor behind KDD is the

research involving database management. Effective data manipulation cm greatly

improve the performance of a KDD system. In this thesis, the KDD technique imple-

mented in DB-Discover, the discovery of characteristic rules, is applied to Internet

dat abases.

In this section, knowledge discovery techniques are categorized based on the type

of discovered knowledge. As well, the DB-Discover system is descri bed.

2.3.1 Types of Discovered Knowledge

Types of knowledge include classification d e s , decision trees, association rules,

characteristic rules, and discriminant rules.

Data classification classifies data based on the values of certain attributes [20]. It is the process of finding the common properties among a set of data in a database. A training set is required to construct a classification model. Each tuple in the training

set consists of the same set of attributes as the tuples in a large database. Each tuple

has a known class identity. The objective of the classification process is to analyze

the training set and develop an accurate model for each class. Such models are used

to classify data in the large database to develop better descriptions for each class in

the database. These descriptions are called classification rules.

A decision tree is generated from a training set. A classification algorithm takes

the training set of attribute values and a class as input. The decision tree for the

training set consists of nodes that are tests on the attributes (32). The outgoing

branches of a node correspond to al1 possible outcornes of the test at the node. The

subset of the training set at a node in the tree are partitioned dong the branches.

Discovering association rules requires the derivation of a set of rules in the form of

Ai A ... A A, + Bi A ... A B,, where Ai (for i E (1, ..., m)) and Bj (for j E (1, ..., n)) are sets of attribute values, from the relevant data sets in a database (201. A good

example is to search for associations among items a customer purchases together in

a single transaction. For example, a KDD system might discover that if a customer

buys macaroni, he also has a tendency to buy cheese in the sarne transaction.

A characteristic discooery fask is to find interesting relationships between various

attributes of one or more relations in the database. A charucteristic relation is a

relation generalized from data retrieved from a database guided by a set of concept

hierarchies.

A discriminant rule is an assertion that discriminates concepts of the target class

from the contrasting class [40]. A discriminant mle can be discovered by generalizing

data in both the target dass and the contrasting classes synchronously and excluding

properties that overlap between the two classes.

DB-Discover is a research software package that permits the discovery of useful

information from large amounts of database data. DB-Discover is applied on rela-

tional databases and produces generalized relations. It performs generalization and

summarization efficiently. Data summarization presents the general characteristics

or a summarized high level view over a set of user specified data from a database.

DB-Discover allows high level, dynamic data organization without modifying the data

itself.

Software for performing characteristic discovery t asks has been implemented in

DB-Discover. DB-Discover is based on an attributmriented generalization algorithm

which takes as input a relation retrieved from a database and generalizes the data

guided by a set of concept hierarchies [18], as explained below.

Data generalization is then performed on the data by applying data generalization

techniques, including attribute removal, concept tree climbing, at tribute threshold

control, propagation of counts and other aggregation function values. The surnma-

rized data is expressed in the form of a generalized relation on which other operations

or transformations can be performed to transforrn the summarized data into different

output formats. For example, the generalized relation can be mapped into charts and

curves, using visualization tools.

A concept hierarchy, as shown in Figure 2.5, is a tree of concepts arranged hierar-

chically according to generality. For discrete valued attributes, leaf nodes correspond

to actual data values which may be found in the database. For continuous valued

attributes, leaf nodes represent ranges of values. Each higher level concept is a gen-

eralization.

Figure 2.5: A Concept Hierarchy for the Province Attribute: Tree View.

2.4 Resource Discovery in the Internet

A variety of researchers are attempting to apply knowledge discovery techniques

to data available on the Internet. In this section, two proposais, the client directory

server model and the multiple layered database model, are described and evaluated.

2.4.1 The Client Directory Semer Mode1

Li and Danzig [54] proposed the Client Directory Server mode1 (CDS) for resource

discovery in the Internet based on the client-server model, shown in Figure 2.1. Thou-

sands of servers are available to provide inf'ormation over the Internet. If every server

had to be consulted for every query, query time would be a bottleneck and there

would be duplicate searches.

Instead, the CDS model, shown in Figure 2.6, uses a directory to determine rele-

vant servers. Formally, a Client Directory Semer (CDS) system is dehed as (c, d, S) , where c is a client corresponding to a user interface that requests information, d is a

server directory that contains a mechanism for dynamically locating incoming server

names, server locations and other information, and S is a set of senters scattered

through the network.

Figure 2.6: Client Directory Server Model.

A server directory records a description and summary for each information server.

In Figure 2.6, the client sends a query to the directory server to identify the servers

that are appropriate for the query. Then the client sends a request to each of these

servers. The dashed lines show that if only one server has to be visited, no other

server will be contacted, which decreases the network load.

-- - - - - - -

Table 2.4: A Sample Server Database.

Table 2.4 shows a sample database with data for four servers- The format of the

query sent by the client to the directory server is shown in Figure 2.7.

Topic knowledge discovery intelligent scheduling knowledge discovery intelligent scheduling information filtering computational linguistics knowledge discovery

Serversame ca.fas.sfu.ca

rnercury.cs.uregina.ca

www .enee.umd.edu

www .rncc.com :80

select serveraame from server-db where one of keywords like "knowledge discovery"

Figure 2.7: The Format of a Query to the Directory Server.

Location Computing Science Simon Fraser Corn pu ter Science University of Regina University of Maryland MCC Corporation

If the client asks, Vind al1 sites that have knowledge discovery related materials,"

the result of the query is CS. fas.sfit. ca, rnercury. cs.ureginu. CU, and www. mcc. coms80.

Then the client sends requests to the three servers ta look for the information.

The advantage of the CDS model is the server directory. It keeps track of each

information server. Before a request is sent to any server, the directory is queried.

Only the servers involved in the request are contacted. This mechanism decreases

query time.

The CDS model is keyword based. The relevance of a database to a query is

deterrnined by keyword analysis of the entry for the database in the server directory

not the database itself. The CDS model does not apply to integrated data sources

because it assumes its input data is text data in unstructured form. As well, the

client server directory model assumes that descriptions of al1 servers are provided.

2.4.2 The Multiple Layered Database Model

Han and Zaiane proposed a multiple layered database model for knowledge and

resource discovery in the Internet 1411. A multiple layered database (MLDB) is a

database formed by generalization and transformation of information. Knowledge

discovery proceeds from the lowest layer to the highest. Layer O is the original

database. Layer 1 and higher layers of an MLDB are constructed on top of layer

O. Knowledge discovery can be performed efficiently in an MLDB once it has b e n

constructed. Generalization from each layer to the next higher layer reduces the size

of the database.

Layer 1

Layer O

Figure 2.8: Multiple Layered Database Model 1411-

20

The MLDB mode1 is based on the studies on databases and knowledge discovery

[34] [Tl] [42]. Wit h the generalization techniques developed for knowledge discovery from databases, the voluminous, primitive information in the global information base

can be t ransformed and generalized into classified, structured high level informat ion.

It can then be stored into a distributive database as layer 1 in the MLDB mode1 [41].

The database constructed in layer 1 is too large to create and maintain. Further

merging and generalization at this level produces a higher layer, layer 2. This level can

be replicated at each site for further integration. Information retrieval and knowledge

discovery can be conducted in this higher level database.

The MLDB mode1 is based on an extended relational mode1 with newly defined

operators to handle information in the form of hypertext and mu1 timedia. An MLDB has three cornponents (s, h, d), where s is a database schema, h is a set of concept

hierarchies, and d is a set of database relations. Successful generalization is a key to

the construction of an MLDB system.

Concept generalization on nonnurneric values, such as indices, keywords, site loca-

tions, server names, and organizations, relies on the concept hierarchies that represent

necessary background knowledge to direct generalization. With the usage of a con-

cept hierarchy, primitive data can be expressed in terms of generalized concepts in a

higher layer. Concept hierarchies in an MLDB rnust be standard and observed by al1 local databases.

An algorithm based on the attributeoriented generalization for the construction

of an MLDB is presented in [41]. A sarnple relation in layer 1 is presented in Table

2.5.

I I - ~ n t o m t I I http://www. ( Vanderbit ) cornputer ( muhine ... kdd tutorid

I voie. 1 Uaiverrity 1 rsience 1 Iearning - vanderbit.edu 1 department h!tp://wrrw. ibformrtion machine jnery reformnlrtion ui.edn Southcrn rcicncc Iearning for dynamic information

1 Cdifornia 1 inititote 1 data mininl 1 integration 1 http://wrrw-db. 1 Stanford ) department of j lntaroperability of 1 an apprarch t o 1 itanford. Univeriity computcr hetcropneour beharior .hiring in edu science databuco federried databue ryrtems .., . . . -.. --. ... ...

Table 2.5: A Sample Portion of a Relation in Layer 1.

By extracting documents related to computer science, a layer 2 relation can be

created by generalization, shown in Table 2.6.

O rganizat ion Vanderbit University University of Sout hern California Stanford University

Table 2.6: A Generalized Portion of a Relation in Layer 2.

...

The advantage of the MLDB model is that it tries to structure dl Internet re-

sources into the relational database model. Then KDD techniques can be applied to

a constructed relational database. The major weakness of the MLDB mode1 is that

al1 data at each layer have to be stored except layer O. With the huge arnounts of data

available on the Internet, it would take too much disk space to store data at the lower

levels, for example, level 1. The MLDB mode1 is a bottorn-up method, that is, no

results are produced until everything has been summarized. As well, there are many

possible summaries for the same documents, and it is hard to obtain and maintain

concept hierarchies.

Interest machine learning machine learning interoperabiity of

Publication ... ...

... heterogenmus databases ... ...

Chapter 3

An Overview of the KDID System

In this chapter, the architecture of the KDID system for knowledge discovery

in Internet databases is described. KDID addresses the need to conduct knowledge

discovery tasks in relational databases available on-line via the Internet.

This chapter is organized as follows. In Section 3.1, the components of the KDID system are described. Section 3.2 covers the database types. In Section 3.3, the

multidatabase model, which is the basis of the KDID system, is introduced, and the

data integration problems are discussed.

3.1 The Architecture of KDID

As shown in Figure 3.1, KDID consists of three main components: the Interface

Module, the Multidatabase Module and the KDD Application Module. Figure 3.2

gives a software hierarchy chart for the main components. The KDID Control Mod-

ule, shown at the top of Figure 3.2, is a relatively small module that implements the

communication paths shown in Figure 3.1. It invocates the other modules and coor-

dinates communication among them. The Internet Front-end is the communication

medium between the user and the KDID system. It receives user requests from Web

pages, sends user requests as HTML form data to the Interface Module, and receives

back information to be displayed. The Interface Module parses the HTML form data

and composes a global query. The global query is sent to the Query Manager of the

Multidatabase Module. The Query Manager then translates the global query into the

KDID internal format according to mappings between local and global schema. The

Query Manager decomposes the internal query into a set of local database queries. It

then checks each query for legitimacy and reports any errors. If al1 queries are decom-

posed successfdly, a database agent is run for each local database. Each agent sends a

query to its local database management system and receives the results. These results

are uploaded into the rnultidatabase system in standardized format. When al1 rele-

vant tasks have been processed, a KDD application is applied to the multidatabase.

The Knowledge discovered is then returned to the Internet Fkont-end.

Each of the modules is descrjbed in greater detail in the remainder of this subsec-

tion.

I ......................................................

Intelnet F m t-end

t

Figure 3.1: The Architecture of KDID.

KDD Application

- --

Figure 3.2: Software Hierarchy Chart.

- '

Query

Module InMace , ~ ~ -

1 1 J Module Manager

3.1.1 The Internet Front-end

The Internet F'ront-end provides two services for KDID: input of discovery task

parameters including a multidatabase query and output of discovered results.

An SQL query sent to a multidatabase can return a large arnount of data. The

KDID control module in Figure 3.2 directs the KDD application module to conduct

the discovery task on the results of the query. The discovered knowledge is given

at a higher conceptual level than the original data because high level concepts from

the concept hierarchies are used instead of base-level data. This helps to reduce the

amount of data returned dramaticdly. If the discovered knowledge does not meet the

user's need, the task parameters can be refined and the user can send a new request.

Knowledge discovered by the KDD application is sent to an Internet browser. The

discovered knowledge is translated to HTML format and displayed on a web page.

Advantages in irnplementing the Internet front-end based on the World Wide Web

include no interface development and versatile hypertext links. No actual interface

programming is needed because the browser provides generic graphical user interface

with features such as navigation, scrolling, interactive fields, text layouts and image

display. KDID supports hypertext links to database query forms and links that

represent database queries. Hypertext links not only provide a simple means of inter-

connecting databases, but they also allow databases to be connected to any resource

accessible via the Internet.

The World Wide Web does not solve al1 database or program problems. A number

of issues remain to be solved. They include performance and interface programma-

bility. The speed at which current browsers download documents leaves much to

be desired, even when the documents are locally resident. There are limitations on

the interfaces that can currently be created using HTML. The lack of a general-

purpose front-end scripting language with commands for storing and retrieving local

information, moving between forms and documents under program control are major

restrictions.

3.1.2 The Interface Module

The Interface Module facilitates the process of obtaining results for a discovery

task involving data from several databases for end users. Many existing knowledge

discovery from database systems can only deal with a single database or a single

type of database. The Interface Module parses HTML form data and composes the

processed form data into a global query. The global query is translated into a unique

intemal format which is decomposed into a set of queries suitable for each database

management system.

Form Data Parser

When the user at the Internet Front-end submits an HTML form, a form data

set is passed to the Interface Module. For example, when a user submits the BTML form shown in Figure 3.3, the form data set show in Figure 3.4 is parsed.

The Form Data Parser parses the form data and obtains a set of (Narne, Value)

pairs. For example, the (Narne, Value) pairs shown in Figure 3.5 are obtained from

the form data set in Figure 3.4.

Global Query Composer

The Global Query Composer takes the processed results from the Form Data Parser

to compose a knowledge discovery task and a global query. The global query is

decomposed into a set of local database queries to mtrieve the relevant data from

the local databases. Some (Narne, Value) pairs are relevant to the discovery task,

and some to both the discovery task and the global query. For example, the (Narne,

Value) pair (tuple, 10000) is a parameter of a typical DB-Discover discovery task that

sets the maximum number of tuples that should be retrieved from the database to

10000. The (TARGET, amant ) and (TARGET, areasode) pairs are the target

attributes of both the global query and discovery task.

From the exarnple result in Figure 3.5, a global query

and a discovery task

are composed.

Figure 3.3: An HTML Form with User Data.

Figure 3.4: An Example of a Form Data Set.

Figure 3.5: Processed (Narne, Value) Pairs.

The global query is also used to specie the relevant data to be retrieved from

the multidatabase for the discovery task. A discovery task has parameters to make

database connections and to load concept hierarchy files. It also has parameters to

set attribute options and thresholds. A predicate of a task might have higher level

concepts.

3.1.3 The Multidatabase Module

The Multidatabase Module provides an interface to multiple databases in a way

that makes them appear as a single database. This module communicates with the

select amount, area-code, disc-code, dept, orgaarne, province, cname from award a, organization b, committee c where a.org-code = b.org-code and a.cname = c-commame and ( disc-code >= 23500 or disc-code <= 24500 )

login dbd dbd connect pas load nserc.chf select amount, area-code, disc-code, dept, orgaarne, province, cname from awaxd a, organization b, committee c where a.org,code = b.org-code and a.cname = c.com,name and ( disc-code >= 23500 or disc-code <= 24500 ) set threshold for default 10 set threshold for tuple 10000 set sum arnount set percent sum

Interface Module and the KDD Application Module. It could be used unchanged

with any KDD application module that makes global queries requiring translation into

retrievals from multiple databases. The global query passed from the Interface Module

is sent to the Query Manager in the Mdtidatabase Module. The Query Manager

translata the query into an internal format according to the schema mappings. The

internal query is then decomposed into a set of local database queries. A database

agent is run for each of the local databases. A database agent forms a gateway to a

local database. It checks the validity and legality of the query passed by the Query

Manager. If the processing operation is permitted, and the local database is registered

to the schema information manager, the database agent sends the query to the local

database and receives the results. The retrieved data are standardized and uploaded

into the multidatabase.

An advantage of the configuration of the Multidatabase Module in Figure 3.1 is

that each database agent is irnplemented and installed at the site where the multi-

database resides. There is no imposition of any code on the local database machines,

which preserves site autonomy, as discussed in Section 2.3. Acceptance of KDID among database administrators should be increased because KDID only requires the

processing of database queries at their sites. A disadvantage of this approach is

that al1 raw data must be transferred across the network to the machine where the

multidatabase is running.

3.1.4 The KDD Application Module

The knowledge discovery applications most closely related to database technolo-

gies are data summarization and generalization tuols. These applications present the

general characteristics or a summarized high level view over a set of user specified

data in a database. Data in databases contain detailed information at the lowest

level. For exampie, the 'award" relation may contain attributes about information

concerning award amounts, discipline codes, and provinces or area codes. It is de-

sirable to summarize a large set of data and present it to the user at a higher level.

For exarnple, a discipline code that is over 23000 and lower than 23500 is "cornputer

science". One such tool available is DB-Discover, which is a data generalization tool.

Features of DB-Discover which KDID adapted to are described below.

Features of DB-Discover

A KDD technique implemented in DB-Discover is uttribute-oriented induction

(AOI). An A01 algorithm takes as input a relation retrieved from a database and

generalizes the data guided by a set of concept hierarchies. In KDID, the input is a

relation retrieved from the rnultidatabase, which is created by KDID from data sets

retrieved from several different types of databases.

To perform a knowledge discovery task requires several steps. First, a database

connection must be established. In the KDID system, since the multidatabase is

created based on the Oracle model, the connection to the multidatabase is under the

Oracle environment, which requires a logon narne and password. The logon name

and password are not transparent to any user because a system script is used. For

example,

(INPUT TYPE= "hidden" NAME="LOGINn VALUE= "login dbd dbdn),

where the input type for KDID is "hiddenn. This feature protects the security of the

multidatabase. After a connection is established, one or more concept hierarchy files m u t be

loaded. Concept hierarchy files are defined using tabbed ASCII files [74]. Figure

3.6 is an exarnple of a tabbed hierarchy file for the DISC-CODE attribute shown in Figure 2.5.

code Compter

HARDWARE 23000-PUW)

SYS ORGANIZATiON 23500-26000

SOrnARE 24tnKL2rlU)O

THEORY 24500-25000

MATHEMATICS 25000-25500

DATABASES 2 5 ~ 2 6 0 0 0

AI 26000-26500

COMPUTING METHODS 26500-27000

Figure 3.6: A Tabbed Concept Hierarchy File.

When the hierarchies have been loaded, the discovery task can be defined. The

first step is t o select target attributes. It is in the same format as a standard SQL query. The format is

select (attribute list),

in which the attribute list may be several attributes. When the target attributes and

tables have been specified, the predicate needs to be defined, e.g.,

where disc-code = "Cornputer" and province = "Westernn.

The d u e s "Computer" and "Western" are not actual data values in the database, but

concepts defined in the DISC-CODE and PROVINCE hierarchies. The system trans-

lates the predicate disc-code = "Computer" into disc-code >= 23000 and disc-code

<= 27000.

Other task parameters affect different aspects of the discovery task. These include

values to set retrieval t hreshold, which determines the maximum number of t uples to

be retrieved, and to set the attribute threshold, which specifies the maximum number

of distinct values that the attribute may have. After data has b e n retrieved, cer-

tain numerical summary information can be displayed. These include setting percent

count, which displays the proportion of the current tuple's counts in relation to the

sum of al1 tuple's count to be displayed. The sum of an attribute is the total number

of items for non-numeric data and the total of the d u e s for numeric data. Sums c m

be automatically calculated by the DB-Discover system [74].

3.2 Databases Types

3.2.1 Overview

As described in Section 2.3, types of databases include relational databases, object-

oriented databases, deductive databases, and multimedia databases. To ensure gener-

ality, three different database models are included in the KDID system. They include

Oracle databases, Mini SQL databases, and Microsoft Access databases. These three

types of databases are relational databases, but the data management and manipu-

lation languages differ. These databases are chosen because they are available on the

local network and it is easier to irnplement the network agents for relational databases.

The other database models are more complicated and current knowledge discovery

tools are designed for relational databases.

3.2.2 Oracle Databases

The Oracle Corporation introduced the Oracle relational database management

system in 1979. The Oracle database management system uses ANS1 standard SQL

as the database access language [58]. It is one of the largest selling relational database

management systems in the commercial world. Oracle developed a superset of the

regular SQL, called SQL *Plus. Any interaction that deals with the Oracle server is

done through these SQL statements.

The SQL standard was dehed by the ANSI/X3H2 cornmittee as a module lan-

guage. A module Lunguage is a small language for expressing SQL operation in pure

SQL syntactic form. It is not possible to use the module language to code direct calls

to the Oracle database management system. Instead SQL statements are embedded

directly into the host Ianguage. Embedded SQL statements are prefixed by EXEC

SQL and terminated by a semicolon in C. An e x ~ u t a b l e SQL statement can appear

wherever an execut able hos t st atement can appear.

3.2.3 Mini SQL Databases

Mini SQL is a database engine designed to provide fast access to stored data with

low memory requirements [45]. Mini SQL offers only a subset of the SQL standard

as its query interface. It is chosen as a test database management system because it

is designed to work in the client server environment over a TCP/IP network.

The Mini SQL server Zistens for co~ec t ions on a TCP socket. The availability

of the TCP socket allows client applications to access data stored on machines over

the network. The Mini SQL system supplies an Application Programming Interface

(API) library, which allows any C program to communicate wi th the database server.

The usage of the API library avoids the usage of the Ernbedded SQL. Table 3.1 shows

sorne of the functions used to implement the network agent to contact the Mini SQL

server over the network.

1 Name 1 Purpoee 1 L m I

1 msalConnect I form an interconnection with a Mini SQL server 1 1 - 1 msqlSelectDB 1 select a database

1 I

Table 3.1: Functions Used to Contact the Mini SQL Server.

- msqlQuery msqlStoreResuIt msqlReeResult msqlListDBs msqlListTables msqlLitFields msqlClose

3.2.4 Microsoft Access Databases

&nd a query to a selected database store result returned by a SELECT query free data space obtain a list of databases retrieve a list of tables obtain information about fields in a table close connection to the Mini SQL server

Microsoft Access is a relational database management system. MS Access provides

a variety of objects to display and manage information. These objects include macros

and modules. A rnacro is a list of actions to be performed by the database system.

For example, the MS Access database system can automatically open a set of forms

when a database is opened. MS Access provides a built-in database programming

language, Access Basic. Procedures can be written in Access Basic for operations

requiring cornplex, automated processing. A module is a Microsoft Access object that

contains Access Basic procedures. For example, a MS Access database may need to

get information from other database systems. Procedures to deal with this action are

modules.

The MS Access database management system uses the Open Database Connectiv-

ity (ODBC) standard defined by Microsoft. ODBC is an industry standard to enable

access to data in most of the popular data sources such as Access, dBase, Paradox,

Informix, Sybase and Microsoft SQL server. Using ODBC, a client can access local

data without knowing the database mode1 of the participating database. An ODBC driver is a database progrannrning interface which allows an application developed for

one database, such as Microsoft Access, to be ported to another database, such as

Mini SQL, without any involvement of the original developer. An application com-

municates with the ODBC driver and the ODBC driver passes each cal1 the database.

Table 3.2 lists some ODBC functions for interacting with a database server.

Name SQLPrepare SQLExecute

Purpose prepare an SQL string for execution execute a prepared statement, using the current d u e s of the

SQLFetch SQLGetData SQLTables SQLColumns

- -- - -- -

Table 3.2: ODBC Functions Used to Contact a Database Server.

parameter markers in the statement fetches a row of data from a result set return data for a single unbound column in the current row return a List of table names stored in a specific data source return the list of column names in specified tables

%QL=~& SQLError SQLDionnect

3.3 The Multidatabase Architecture for the KDID System

load a driver and establiihes a connection to data source return error or status information chse connection associateci with a specific connection handler

3.3.1 The Architecture

Users of KDID view al1 the local databases as a single database and leave the

interna1 view to the system. Consider a user who requests information on heart disease

medication in three different hospitals. Information about medication and drug usages

and pharmaceutical manufacturer is distributed among different databases. For users

of the KDID system, their goal is to obtain the knowledge from the different databases,

that is, there will be only the retrieval task involved at the local database level.

There is no need for the KDID system to update, insert or delete any data from any

local database. These tasks will be taken care of by the local database management

systems.

Figure 3.7 shows the architecture of the KDID d t i d a t a b a s e model. The global

database agent module parses a global SQL query, which in the KDID systern is part

of a discovery task entered through the Internet front-end. The global database agent

Figure 3.7: Architecture of the KDID Multidatabase.

contacts the schema information manager to check the legitimacy of the query. If the

query is legal, it is sent to the query processor, which xi11 translate it into the KDID

system's internal query format. Each a t tribute bas its full pat h to each local database.

The global query processor will check the attribute names used in each local database.

When the global query is translated into the internal query, the query decomposer

takes it as input to decompose into a set of local database queries. Each of the local

queries will be sent to a local database that has its own local database access module

to access the local database management system. These databases exist a t different

geographic locations and use different database schema architectures. Each employs

its own database management system with its own database query and manipulation

languages.

The Schema Infornation Manager coordinates the global database agent and the

local database agents. When a local database administrator decides to register his/her

database to the KDID system, the Schema Information Manager contacts the local

database, using the information supplied by the local user. Once the local schema

information has been retrieved, the schema information manager supply information

about the global schema. The local database administrator then describes the r e

lationship between the global and the local attributes. If there is no relationship

between their database and the databases registered, the local database adminis-

trator can abort the registration process. The registration is strictly voluntary and

before the registration, informat ion on the KDID system, and the brief descript ion

of the databases registered are presented to the local database administrator. Once

the local attributes have been selected for sharing with other databases, a concept

hierarchy is searched and displayed for the user.

The multidatabase system suits the KDID knowledge discovery task. It provides

a means for resolving the differences in data representation and function arnong local

database management systems, and it discovers more rneaningful knowIedge for users.

Data gathered from several databases include a wide range of information, and pattern

discovered based on these data might be more interesting.

KDID is capable of integrating relational databases in this research, but as ob-

served, the system can be extended to include network, hierarchical and objected-

oriented databases. The users will be provided with a global integrated view of data

stored in different databases. User requests will be formulated in terms of the i n t e

grated data mode1 and then translated into local databases queries. The information

to resolve data c o ~ c t s is stored in the KDID system directory.

3.3.2 Data Integration

Creating a multidatabase is complicated by the heterogeneity and autonomy of its

local systems. Heterogeneity exists through difierences at the operating, database,

hardware, or communication level of the local systems. In this section, the focus

is on the database integration issues. The types of data integration, the reason

the differences occurs, and examples of how to solve the integration problems are

discussed. The issues include at tribute name difference, at tri bute value difference,

which is purely syntactic, scale and type difference, missing data, and conflicting

values.

Each database participates in a rnultidatabase mode1 by exporting a part of its

schema, called the ezport schema. A global schema is created by the integration of

multiple export schemas. If there are no relations arnong the concepts represented

in each local schema, the global schema is simply the union of the local schemas.

The same concepts rnay be represented in different databases, and concepts rnay be

represented different ly.

Types of data integration confiicts are described below. According to Batini et al

(71, different perspectives and equivalent constructs cause data integration conflicts.

Perspective difference is a modeling problem caused during the design phase of a

dat abase schema. Different designers adopt different viewpoint s when modeling the

same information. The rich set of constructs in data models allows for possibilities of

modeling. This results in variations in the conceptual database structure. Typically,

in conceptual models, severaI combinat ions of constructs can mode1 the same real-

world domain equivdently.

Attribute Name Difference

Attributes having the same meaning may be given different names in different

databases. These are called synonyms. Anotber case are homonyms, that is, two

different attributes having an identical narne. Homonyms c m be resolved easily by

changing one of attribute names into a different name in the global database.

Attribute Value Difference

Owing to design perspective differences, different designers rnay use different mod-

els for the same entiicj. in different databases. For example, some designer rnay assign

one character to the attribute SEX, others rnay assign six, thus the data value re-

trieved from different databases are different. This can be resolved by the standard-

ization procedure.

Scale and Type Difference

The same attribute rnay be stored in different databases using difierent units of

measures. This difference cm be resolved by defining a conversion function when

the data is retrieved. For exarnple, if the attribute PRICE is measured in one

database in Canadian dollars, but measured in another in US dollars, a function

price = price x 1.36 c m convert the US dollars into Canadian dollars.

Different database systems define different data types and most relational databases

define according to the SQL standard. Type differences among database management

systems create problems when implementing the multidatabase modeI. Types have

to be converted for data definition and data manipulation before the mdtidatabase

creation. For example,

CREATE TABLE tempo ( col1 NUMBER(10) NOT NULL, col2 CHAR(5O)

)-

Table 3.3 shows the data types defined in Orade, Mini SQL, and Microsoft Access

and their inter-relationships. The names of data types and their parameters differ

when used for data definition. This makes an application program dependent on the

underlying database management system.

I

I I

1 1 UNSIGNED 1

Oracle

1 CHAR I I

1 CHAR 1 TEXT 1

Mini SQL

LONG M W M W LONG

1 NUMBER ( 1 I

1 CURRENCY 1

Microeoft Access BIT

LONGBINARY

MEMO

m

1 1 1 SHORT 1 FLOAT

INTEGER LONG DATETIME

VARCHAR.2

Table 3.3: Data Types in Oracle, Mini SQL and MS Access.

The data type mapping developed by Steindl[80] is presented in Figure 3.8. Based

on this mapping, a function can be implemented to convert Mini SQL data types and

Microsoft Access data types to Oracle data types.

void mapping(char *ldbtype, char *oracletype, int paralen)

p a m b n is determined by the local database type ldbtype. If ldbtype is a string of

characters, the length has to be determinecl. If it is a number, the precision is required.

Figure 3.8: Type Mapping

Missing Data

A database at one location may not store d l the information of interest concerning

an entity. Data can be present in one database but missing from the other; data in

one relation may be the summarization of the other relation.

Conflicting values

If databases at different geographic locations store information concerning the

similar data items, there is danger of codicting values. It is difficult to establish that

a confiict exists and to correct the discrepancy. If there are "Medication" relations at

two databases, how do we determine when the same drug is being prescribed in each

relation? If the drug has pices in each relation, should these prices necessarily be

equal? One option is to project both relations, then later in the process of analysis

and discovery, it might be safe to take an average depending on the discovery factors.

hrthermore, in the process of knowledge discovery, the exact arnount or exact price

may not be required; instead, a range may be required.

3.3.3 Multidatabase Query Processing

The user input from the Internet Front-end first passes through the HTML form

data parser , which identifies the necessary entries for query composition. The global

query is composed by the query composition module and is translated into a full-path

interna3 query. The interna1 query is decomposed into a set of local database queries,

which in turn are sent to different local database agents. If a local query is legitimate,

the query is executed against the local database management system.

A global query is a full-path query that rnay include constant predicates and join

conditions. A local database query has only the constant conditions with the join

condition decomposed into attributes to be retrieved.

The format of a global query is:

select (attribute list) from (relation list) where (constant condition) and (join condition)

The attribute list may contain several full-path attributes in the form of database:

relation. attribute. The relation list is of the form database:relation.

The basis of querying processing in KDID is the SQL select statement. Condi-

tions in the select statement have one of the following formats:

1. A 8 d or A 8 B, where 8 E (=, >, 2, <, s), d is a primitive value, and A and

B are attribute names. For example,

means that nsefc:awurd.arnount, nserc:award.org,code, and pas:oryanization.org,code are attribute names, and that 10000 is a primitive value.

2. A O c, where B E (like, not like}, c is a character string, and A is an

attribute name. For exarnple,

pas .organization.province like S%

means that pas.organization.province is an attribute name and that S% is any

character string starting with the letter S.

3. A 9 B, where 8 E { in, not in}, and A and B attribute narnes. For example,

pas.organization.province in (Alberta, Manitoba, Ontario)

means that pas. organization.province is an attribute name and that O can choose

from in and not in.

Chapter 4

Design Issues

In this chapter, design issues related to the KDID system are described. Var- ious design issues that arose during the implementation of a prototype system are

described and algorithms for the major modules are presented.

Figure 4.1 is the flow diagram for the query processing. A user query is parsed and

assembled from an HTML script file, which in turn is translated into an internal query.

The internal query is then decomposed into a set of local queries. h s u l t s from the

local queries are retrieved and transformed into internai results. The internal results

are imported into the multidatabase system. The user results are the final knowledge

discovered, using one of the KDD applications.

This chapter is organized as follows. In Section 4.1, design issues reiated to the

Interface Module are discussed. Section 4.2 presents the on-line database registration

process, and the mappings of local database schema information to the global schema.

Section 4.3 describes the global query to internal query translation algorithm. In

Section 4.4, the global query decomposition algorithm is proposed and examples of

decomposing a global query are given. In Section 4.5, mechanisms for resolving data

value differences are discussed.

4.1 The Interface Module

Two design issues related to the Interface Module are the design of the HTML f o m data parser and the global query composition algori t hm.

Figure 4.1: Flow Diagram for Query Processing.

4.1.1 The Forrn Data Parser

An HTML form data parser is provided in Figure 4.2. Given a form data set, this

parser removes redundant special characters, converts the hexadecimal numbers to

ASCII codes, and collects al1 (NAME, VALUE) pairs. It is described below.

Let S = (N, V, P) be an HTML form data set, where N is a set of HTML com-

rnands, V is a set of values following the HTML commanda, and P is a set of special

characters and hexadecimal numbers. Let E = {E; = (Ni , K) 1 Ni E N, K E V ) be a

set of processed (Narne, Value) pain.

The parser procedure processes a form data set S, removes redundant elements

in P and returns al1 tuples in E.

When a URL containing a link to an HTML form script is activated, a form a p

pears on a Web page. After the user fills in the fields and submits the form, the parser

begins to check the CONTENTTYPE of the form. If the CONTENTTYPE is

not application/x-wu-f orm-urlencoded, the procedure prints an error message

"encoding method not implemented'' and exits. Otherwise, the parser procedure

continues to check the request method. The request method implemented for the

Procedure parser

Input: S, an HTML form data set Output: E, a set of (Name, Value) pairs begin

t := get(S, "CONTENTTYPE") if t # "application/x-www-form-urlencodedn then

return error(Unot-implemented") end if rn := get(S, UMETHODn) if m ="POST" then

Zen := get(S, u C O N T E N T J E N G T H n ) while Zen > O do

N;' := getword(S, len) K' := getword(S, Zen) Ni := del dpcialdhar-converthez(N;')

:= delspecial~har-converthez (K') if V;. # null then

add (Ni , V;:) to E end if

end while else

return error(Umethodrnust4e90STn) end if return E

end

Figure 4.2: The Form Data Parser.

parser is POST. If a request method other than POST appears in the script file,

an error message umethod must be POST" is printed and the procedure exits. If

the method is POST, the CONTENTLENGTH is obtained from S. The function

getword extracts Ni and from S and the length Zen of S is decremented based on

the lengths of Ni and y. The length of eaeh N: or y is different. The pair Ni and K are the results after the special characters are removed and the hexadecimal nurnbers

are converted to ASCII codes. Al1 pairs are returned, using a data structure.

4.1.2 Global Query Composition

Let R be the assembled SQL query. E = {Ei = (N i , x) 1 Ni E N, E V ) is

defined in Section 4.1 .l. The GQC procedure for global query composition, as shown

in Figure 4.3, takes the (NAME, VALUE) pain processed by the HTML form data

parser as input, and assembles thern into an SQL like global query. The output string

R is sent to the query translation function.

The GQC procedure assumes that the series of pairs begins with one or more pairs

specifying the target of the SELECT staternent , followed by one pair specifying the

table name, then followed by zero or more pairs specifying predicates and connec-

tors to be combined to form a WHERE clause. The literal strings "TARGET",

"PREDICATE", and UCONNECTOR" are reserved words in HTML script files.

The GQC procedure checks each attribute name Ni to determine if it is "TARGETn . If so, the corresponding target attribute value is concatenated to R. The variable

next is updated to the position of the pair immediately following the last observed

pair with name Ni equal to "TARGETn. If Nnmt is "TABLE", then VnWi is concate

nated to R in the format of database : table. Otherwise, an error message "table not

foundn is printed and the procedure exits. When the procedure begins to process

the predicates, if Enac+l is the first predicate, the keyword WHERE is concate-

nated onto R. The procedure processes the keywords like and in separately to ensure

that the special characters, such as single quotation marks, double quotation marks,

backslashes, percent signs, and ampersands, are processed appropriately.

Procedure GQC

Input: E, a (Name, Value) set Output: R, an output SQL string begin

R := "SELECT" for each tuple Ei E E do

if Ni = "TARGET" then R := concat (R, l$) next := i + 1

end if end for R := cmrcat(R, "FROM") if N,,t # 'TABLE" then

return error(Utablenot- f a n d " ) else

R := concat (R, V,,,t) end if if next < 1El thea

R := concut(R, "WHERE") for i = next + l...IEl do

if N; = "PREDICATE" or Ni = "CONNECTOR" then R := concat(R, K)

else return error (Upredicatemlionnector,not- f ound" )

end if end for

end if return R

end

Figure 4.3: Global Query Cornposi tion.

4.2 On-line Database Registration

Before a query can be translated into local database queries, the mappings from

the global schema to each of the local schemas must be resolved. In [76][81][59], the

authors assume that mapping functions have been implemented and no details of

mappings are described. The mappings are vital to KDID since they involve Internet

databases. A fundamental assumption in KDID is that each local database aùmin-

istrator register his/her local database with the KDID system. By registering their

databases with KDID, tbey are providing access to their information under controlled

conditions. This sharing allows the discovery of more interesting knowledge.

An on-line strategy for mapping database schemas is proposed in this thesis. Us-

ing HTML forms provided by the KDID database registration subsystem, the local

database administrator can manipulate the meta-data, that is, information about

database schemas and their intended meaning, to indicate how their local database

maps to a global schema provided by KDID. In the remainder of this section, the

registration subsystem is described with emphasis on security aspects and the regis-

tration process. Secure connections are used becauee the information about the local

databases is sensitive; for example, a local database account name and password

might be provided for a dedicated account.

4.2.1 Security Issues

In this section, the relevant security issues are discussed and protocols and stan-

dards are introduced.

As the development of the Internet continues, the security problem is being con-

sidered. Security is a baseline requirement for network computing. Privacy, authenti-

cation, authorization, and integri ty are al1 required in any security strategy to prevent

eavesdropping, manipulation, and impersonation [63].

Various solutions to the security problem have been proposed and implernented.

The Netscape Company proposed the Secure Sockets Layer (SSL) transport prote

col to improve Internet security. The SSL protocol provides data encryption, server

authentication, message integrity, and optional client authentication for a TCP/IP

connection [63]. The connection security provided by the SSL protocol has t h e prop

erties: private connections, peers' identity aut hentication, and reliable connections.

Encryption is used after an initial handshake to define a secret key [63]. Symmetric

cryptography is used for data encryption. To authenticate a peer's identity, asym-

metric (or public key) cryptography is used. Message transport, which includes a

message integrity check using a keyed Message Acknowledge Code (MAC), ensures

reli able connections.

To manipulate schema information through the Internet, a secure Internet server

must be set up. The Apache-SSL is a secure web server, based on Apache [3] and the

SSL protocol. Apache is an HTTP server based on the NCSA httpd server version 1.3

with increased functionality, speed, and reliability. The Apache-SSL server has every

feature described by the SSL protocol. The server uses a single X.509 certificate that

enables the server to authenticate itself to clients requesting SSL connections. When a

server presents a certificate during an SSL handshake, the Internet browser checks the

certificate against its certificate database. If the server certificate is in the database,

or if the server certificate is signed by a Certificate Authority whose certificate is in

the database, the SSL handshake can conclude successfully. The format and meaning

of the X.509 certificates are defined by RSA Laboratory Inc. ($61.

4.2.2 The Registration Approach

Figure 4.4 describes the steps taken to register a local database with the KDID schema manager, retrieve local schema information and to integrate them with the

global schema. The local database to be registered is assumed to contain data relevant

to the subject of the global schema. Consider a global schema concerning books

published. The local user here is assumed to be the local database administrator

who knows the content of one publisher's database and is aware of the KDID system.

When a local database adrninistrator decides to register his/her local database, a

hypertext link is activated and a registration form is displayed. This connection is

secure, based on the secure server setup using the SSL protocol.

When the user fills in the required information and submits the registration forrn,

Stepa for On-line Database Registration

1. Local database administrator (DBA) initiates registration; 2. Establish a secure connection between server and client; 3. Local DBA completes the application form; 4. Parse user submitted data; 5. Create a database agent using user data; 6. Retrieve global schema information with their intended meaning; 7. Generate an HTML form for the local DBA; 8. Process user su bmitted mappings; 9. Update the global schema information.

Figure 4.4: On-line Database Registration.

the KDID system parses the user data and creates an agent for this database. The database agent uses the supplied information to determine the network location of the

database; for example, the address of the NSERC database is chiron. CS. uregina. ca.

Schema information is retrieved for the relations specified by the local database.

After the local schema information has been retrieved, the global schema is queried

to obtain all global attributes and their intended. meaniogs.

An HTML form is generated autornatically based on the global and local schema

information. In the first part of the form, each global attribute is shown, preceded

by a number. The second part of the form displays the content of the local database

schema, including the name, type and length for each local attribute. The local user

is required to fil1 in the number of the corresponding global attribute. If there is no

mapping for a local attribute in the global schema, the default is "N/AV.

1 Global attribute name 1 dbname:relation:attribute[, dbname:relation:attribute] ... 1 Table 4.1: Format of Schema Mapping.

The last step is to update the global schema according to the user mappings. The

format of schema rnapping is shown in Table 4.1. The part in square brackets is

optional, that is, a global attribute can be in several local databases and their local

Procedure giobalqyAo>nternalqy

Input: GQ, a global query Output: IQ, an internd query string

procedure global-qry-tointernal-qry begin

get-t hree,parts(GQ, targets, relations, predicates) processA3argets(targets, IQ) proceasall~elations(relations, IQ) processall-predicates(predicates, IQ)

end {global-qry-tointernal-qry)

Figure 4.5: Global Query to Internal Query Translation.

names might be different. Table 4.2 is an example.

Table 4.2: Examples of Schema Mapping.

DEPARTMENT ORG-CODE

4.3 Global Query t o Internal Query Tkanslation

NSERC:AWARD.DEPT, COMACC:SCHOLARSHIP.DEPARTMENT NSERC:AWARD.ORGCODE. PAS:ORGANIZATION.ORG-CODE

GQ is the assernbled global query. Let IQ be the internal query. Figure 4.5 shows

the global query to internal query translation procedure. The main objective is to

decompose the query string passed from Figure 4.3 into three parts: the targets, the

relations, and the predicates. Each part is processed separately, with target attributes,

relations, and predicates translated into attributes, relations and predicates with full

paths to local databases. Figure 4.6 shows the procedure to process all predicates. The

process-all-targets and the process-ail-relations procedures parse the target attributes

and al1 relations in the global query respectively.

The process-al/-predicates procedure get s the nurnber of total predicates and the

number of items in each predicate. Each predicate is processed and every attribute

Procedure process-all+redicat es

Input: predicates, an set of predicates Output: IQ, an internai query string

procedure processall-predicates begin

num-predicates := get,predicatemum(predicates) for i = 1 to numqredicates do

nurn,itemsinqredicate := getitems(predicate) for i = 1 to numitemsin-predicate do

if i # num-items-inpedicate - 1 then skipspecid-character(predicate, COMMA)

else skipspecid-character@redicate, ENDOFSTRING)

end if tmp-pred := processxach,predicate@redicate, attr') attr := lookup(attr) if attr # nul1 then

n e w ~ e d := assemble-predicate(attr) else

print er ror ("mappingnot,fmnd") end if I Q := concat(IQ, new-pred)

end for end for

end (processall-predicates)

Figure 4.6: Procedure processall-predicates.

is retrieved based on the schema mapping information from the system directory for

the valid path, using the function LohEup. If an attribute cannot be found from the

directory, an error message is passed back.

4.4 Query Decomposition

4.4.1 The Decomposition Algorit hm

The decomposition algorithm to process a translated internal query is presented

in Figure 4.7. Given an internal query, SELECT A FROM Tables WHERE Conds,

for each table specification DB..T appearing in Tables where DB is a database narne,

and T is a table name, a local database query

SELECT A D B . . ~ FROM T WHERE C m d s D B . . ~ is generated, where ADBbaT consists of each attribute N A M E such that DB. .T .NAME occurs either in A or in Cmd.

C m d ~ ~ . . ~ consists of those conditions in Conds involving only constant terms

and terms of the form DB..T.NAME, that is, CmdsDs.r does not involve any

references to other tables DB'..T' where DB # DB' or C # Cf.

The local database query is formed by consdting the schema mapping information

in the system directory to remove parts involving tables other than Dl?. .T. The query

is then translated into the specific DBMS manipulation language, replacing t e m s of

the form DB..T. N A M E with the attribute name N A M E . If C O T Z ~ S ~ ~ . . ~ is empty,

the WHERE keyword is omitted.

4.4.2 Query Decomposition Examples

To illustrate the decornpositon of internal queries, let us consider, an example mul-

tidatabase environment. The multidatase consists of four databases located at three

different sites: REGINA (Headquarters and Hospitals), WINNIPEG (Hospitals), ED- MONTON (Phanneceutical Manufacturing). Each local database management sytem

may be different. In Table 4.3, DBi at REGINA, DB2 at WINNIPEG, and DB3 at

EDMONTON are three local databases, and DB4 at REGINA is the multidatabaçe.

Algorithm For Query Decomposition

Input: Query of the form Select (A) h m (Tables) Wherre (C) Output: (database, query) pairs

Define DB to be the k t of databases in (DBS) ' A t o be the list of attributes in (A) T t o be the list of tables in (Tables) C t o be the lit of conditions in (C)

for each table DB..T do for each DB E DBS do

generate SELECT ADB..T FROM T WHERE CDB..T where every attribute N A M E E A D B . . ~

such that DB..T.NAME E A U DB..T.NAME E C and CDB..T E C

where C'B..T = constant terms U CD..T = DB..T.NAME and CD~..T # CDBI..TI such that DB # DB' U T # T'

check the system directory to remove parts not in DB..T translate query into a specific DBMS manipulation language replace DB..T.NAME with attribute name N A M E if CD~..T = NULL

omit keyword WHERE end if

end for end for

Figure 4.7: Query Decompost ion Algorit hm.

Three database schemas exist to store the medical information.

Table 4.3: Dat abase Informat ion for the Example Mult idat abase.

Schema A PATIENT (PNO, Name, Sex, Age, Phys, Diag, Pres, Drug-Use-Time) PHYSICIAN (Physno, Physname, Deptno)

Location REGINA WINNIPEG EDMONTON REGINA

Schema B DRUG (DNO, Dname, Ingredient, Manufacturer, Date)

Schema A A B C

Database DBI Dl32 DB3 DBa

Schema A exists in database DBi at site REGINA, and in database DB2 at site

WINNIPEG. Schema B exists in database D B3 at site EDMONTON. Only patients,

drugs and physicians at local hospitals are stored in databases D Bi, DBz, DB3, re-

spectively. Schema C is created dynamicdy based on the global query created by

the Interface Module. Schema C exists in the multidatabase DB4 at site REGINA, where the global information about patients, dmgs, physicians is stored. An example

of schema C wodd appear like

Local/Global local local local

global

PATIENT ( S m , AGE, DIAGNOSIS, PRESCRIPTION, ONDRUG-TIME).

Consider the following discovery task: ULook for interes ting relations between

male patients over age 60 and their medication". A suitable global query to retrieve

data relevant to this discovery task, generated by the Interface Module, is shown in

Figure 4.8.

A user is unaware of the complexity of the local and global schema mappings, and

the details of local database schema information, he/she can only submit qualified

SELECT S E X , AGE, DIAGNOSIS, PRESCRIPTION, ONDRUGTIME

FROM PATIENT, PHYSICIAN, DRUG

WHERE SEX = 'male' AND AGE > 6 0 AND PATIENT.Drugno = DRUG.DNO AND PATIENT.Physno = PHYSICIAN.Physno

Figure 4.8: Global Query.

SELECT DBl : PATIENT.SEX, DB2 : PATIENT.SEX, DBl : PATIENT.AGE, DB2 : PATIENT.AGE, DB1 : PATIENT.DIAGNOSIS, DB2 : PATIENT.DIAGNOSIS, DBi : PATIENT.PRESCRZPTZON, DB1 : PATIENT.ONDRUG2'IME, DBl :.PATIENT.ONDRUGJ'IME, DBi : PHYSICIAN.Physnarne, DBi : PHYSICIAN.Physnarne DBl : PATIENTBrugno, DB3 : DRUG.DN0

FROM DBl : PATIENT, DB2 : PATIENT, DBl : PHYSICIAN, DB2 : PHYSICIAN, DB3 : DRUG

WHERE DBl : PATIENT.SEX ='male7 AND DB2 : PATIENTSEX ='male' DB1 : PATIENT.AGE > 65 AND DBa : PATIENT.AGE > 60 AND DB1: PATIENT.Drugno = DB3 : DRUG-DNO AND DB2 : PATIENT.Drugno = DB3 : DRUG.DN0 AND DB1 : PATIENT.Physno = DB1 : PHYSZCZAN.Physno AND DB2 : PATIENT.Physno = DB2 : PHYSICIAN.Physno

Figure 4.9: Internal Query.

SELECT SEX, AGE, DIAGNOSIS, PRESCRIPTION, O N D R U G T I M E , Physname, Drugno

FROM PATIENT, PHYSICIAN

WHERE SEX ='male' AND AGE > 60 A N D PATIENT.Physno = PHYSICIANPhysno

Figure 4.10: Local Query Submitted to DB1, D&.

SELECT DNO, Dnarne

FROM DRUG

Figure 4.11: Local Query Submitted to DB3.

queries based on the information supplied by the HTML fom. The user's work is

made easier by the Intemet front-end and HTML forms, which require a user to enter

as few as possible keywo~ds to start a discovery task. After a fully qualified query

has b e n successfully composed, it is submitted to the query manager for further

processing. In this phase, the query manager produces a translated internal query,

as shown in Figure 4.9.

The query manager then calls the decompose procedure to decompose the internal

query string. The set of local queries L decomposed consists of queries pruned for

each local database systern. For this example, L includes CDBlr LDBa, .LDBs, as shown

in Figures 4.10, and 4.11.

Each local query is submitted to the local database management system. Under

the control of the locd database management systems, all local queries are executed.

The transaction of each local query is monitored by the query manager. If there is

any error, or transaction failure, an error message is returned by the query manager.

This makes it easier to keep track of the work of each local query.

The results of local queries are sent back to the headquarters' site at REGINA to

be inserted in the global database, DB4. Thus, when a user submits an HTML form, a query is composed by the Interface

Module. The query manager translates it into an interna1 query, and decornposes it

into a set of local queries. If the local queries are executed successfully, the local results

from each local database management system are transfered to the multidatabase at

the headquarters' site. The results are inserted into temporary tables created in the

multidatabase, DB4. The KDD application module is then executed, based on the

tables in LI&. The result returned to the user is the discovered knowiedge, that is,

the user result shown in Figure 4.1.

4.5 Mechanisrns for Resolving Data Value Differences

In this section, mechanisms are presented for resolving data value and scale dif-

ferences among component databases. When the data for the sarne attribute are

represented differently in two databases, it is difficult to provide a solution to a u t e

mate the standardization process. In (151 [24], the authors do not give any solution

but assume that the database administrator know the differences between the two

data sets retrieved from the two databases. The database administrator may either

treat the data differently by changing the attribute name in one database to another

name or retrieving al1 data and let the user decide.

4.5.1 Data Value Standardization

In this thesis, an assurnption for resolving data value differences is that a stan-

dard dictionary is supplied for each attribute in the global schema. The format for

a dictionary name is attribzrte.dic. For example, the dictionary for the attribute

PROVINCE is province.dic. If the type of an attribute is nurnber, the values of

that attribute are not checked. The reason is discussed in the next subsection.

Figure 4.12 shows the data value standardization algorithm. Qresult is an attribute

set retrieved from a relation. The algorithm checks each attribute attribute;. If the

type of attribute; is not nurnber, each value of attribvtei is checked. If the value

Stiandardization Algorithm

Input: qresult, a set of retrieved results Output: sresult, a set of standardized results

begin for each attribute; E qresult do

if attribute;.type # numberdype t hen for each attribute, E uttribute; do

standard := lookup(attribute;.dic, attribute,) if standard = not found then

insert (attributei .die, a t t r i b~ te ;~ ) insert(sresult, attribute,)

else insert (sresult, standard)

end if end for

end if end for

end

Figure 4.12: Data Value Standardization.

at tr ibutei j is found in the standard dictionary for at t r ik te ; , the value is replaced

by the standard value and added into sresult, the set of standardized result. If the

attribute value from the retrieved set cannot be found in the dictionary, add the value

into the dictionary as the standard value, and put it into sresult.

The procedure lookup, shown in Figure 4.13 uses a data structure called a hash

table. The attribute values and their standards are kept in linked lists as buckets.

The hash function hmh selects a slot in the bucket to keep an attribute value and

its standard value. If an attribute value is passed in to check if it has a standard

value, the first step is to determine which bucket to search. Then a linear search is

perforrned through al1 nodes in the bucket. If a standard value is found, return it.

Otherwise, "not found" is returned. The advantage of the hash table method depends

on the size of the bucket. The size of a bucket is normally a suitable power of 2 [43].

Procedure lookup

Input: dictionary, a set of standard values attribute, an attribute value

Output: standard, a standard value

procedure lookup(didimary, attribute) begin

bucket : = put A c t ionuryin4ucket (dictionar y) index := hush(attribute) for each node in bucket[indez] do

if compare(bucket[index] -+ item, attribute) == O t hen return bucket[indez] -î standard

eise ret urn 'net f ound"

end if end for

end

- -

Figure 4.13: Procedure lookup.

For example, if the size of the bucket is 64, the search is 64 times faster than to search

all values of dictionary.

Figure 4.14 is an example of the hash table. The bucket has 8 dots and contains

the values of Canadian provinces and their standard values. To make the diagram

easier to illustrate, it is assurned that the hash function return the same value O for

"ABn and "BC", 3 for '<MBn, '<NBn and "NFn, 5 for "ONn and "PEI", and 8

for 'PQn and '<SKn. When the hash function is implemented, it is crucial to choose

a good function to distribute values evenly.

Table 4.4 gives an example for the attribute '<provincen, showing variations in its

values and its standard value. By convention, a province may be in full name, or in

abbrevations, for example, '<Saskatchewann might be abbrevated as 'SKn or 'Saskn

or in full name as "Saskatchewan". The results retrieved from different databases

might have different values. Wi thout standardization, the multidatabase treats the

values differently even if they refer to the same attribute.

- O 1 NB] New Brunswick 1 - - - 4 NS 1 Nova Scotia la ONI Ontario - - - -

1 Bucket - - - -- -

Figure 4.14: A Hash Table Example.

1 Variation A ( Variation B 1 Variation C 1 Standard 1

B.C. Alberta British Columbia Manitoba New Brunswick Newfoundland Nova Scotia

Man. N.B. Nfld N.S.

AB BC MB NB NF NS

Manitoba ' New Brunswick Newfoundland Nova Scotia Ontario Prince Edward Island

Ontario Prince Edward Island

Table 4.4: Example of Provinces and their Variations.

ON PE

Quebec Saskatchewan Northwest Territory Yukon

Ont. P.E.I.

&c SK NT YT

P.&. Sask. ' N.W.T. Yuk.

Quebec Saskatchewan Northwest Territory Yukon

The standardization algorithm resolves the value difference problern. If i t finds a

standard for an attribute value in the standard dictionary, the standard value is put

into the result set. If a standard value cannot be found, the attribute value is inserted

into the standard dictionary. This mechanism guarantees that every attribute value

in the retrieved data set has a standard value.

4.5.2 Scde Conversion hnctions

The standardization approach in Section 4.5.1 is admittedly inadequate because

of the scale difference issue discussed in Section 3.3.2. For example, the attribute

PRICE might be measured in Canadian dollars in one database, but measured in

US dollars in another. As a database is registered, a conversion function for attribute

values from a local database to the global schema rnust be specified. When the

database registration subsystem is activated, a database agent is created to retrieve

the local schema. For each attribute, a conversion function is provided if the attribute

type is number. The default for an attribute is 1, that is, no conversion is necessary for

attribute values. For example, the attribute PRICE can have a conversion function

as CND = USD x Exchangerate. A fill-in area is proceded after Exchangerate on

the HTML forrn. The database adminstrator can fill in the exchange rate. When the

registration subsystem processes the registration form, a set of conversion functions

are provided to make the attribute value conversions.

Chapter 5

Prototype Design and Testing

In the previous chapters the architecture of the KDID system has been presented

and design issues have been described. A prototype of the KDID system has been

implemented and tested. In this chapter, the design of the prototype is described

and preliminary testing on the Nat ural Sciences and Engineering Research Council

(NSERC) data is presented. First, in Section 5.1, the design of the multidatabase is

exphined. Each local database and the relationships among the local databases are

described. Section 5.2 describes the database registration process, and section 5.3

presents two typical knowledge discovery tasks, In Section 5.4, the speed analysis of

the data retrieval from the participating databases and data value standardization

are presented.

5.1 Constructing a Test Multidatabase System

A multidatabase system was created for testing the KDID system by partitioning

an existing database. The original database was the Natural Sciences and Engineer-

ing Research Council (NSERC) database, which is an archivai database of award

and grant listings, grant allocations and award committees. This original NSERC database was partitioned into three related databases, with different schema and

data. Section 5.1.1 presents the global schema. Section 5.1.2 describes an Oracle

database of awards, Section 5.1.3 a Microsoft Access database of award committees,

and Section 5.1.4 a Mini SQL database on organizations and grants. Description of

1 CNT2 I I

1 NOT NULL 1 NUMBER 1

Attribute Name AMOUNT AREA-CODE

- - I 1 - 1 DEPARTMENT 1 NOT NULL 1 VARCHAR 1

Nuli NOT NULL NOT NULL

1 ORG-CODE 1 1

1 NOT NULL 1 NUMBER 1

Type NUMBER NUMBER

I

NUMBER NUMBER

C

COMP-YR COMMITTEE

Table 5.1: Global Schema of the AWARD Table.

NOT NULL NOT NULL

PROJECT RECEIPIENT

I I

PROVINCE ( NOT NULL 1 VARCHAR

NOT NULL NOT NULL

Attribute Name ORG-CODE ORGNAME

Table 5.2: Global Schema of the ORGANIZATION Table.

VARCWAR VARCHAR

the databases and the relations among them are presented in Section 5.1.5.

Nuli NOT NULL NOT NULL

51.1 Global Scherna

Type NUMBER VARCHAR

A global schema for a mutidatabase system provides users with an integrated view

of the multiple databases. A user does not have to know the underlying structure or

schema information of any participating database. Knowledge discovery is conducted

in the front-end multidatabase.

For the example multidatabase system, part of the global schema information is

shown in Table 5.1 and 5.2.

The "PROVINCEn attribute in the ORGANIZATION table, shown in Table 5.2,

is categorized according to the region information, into a concept hierarchy, presented

in Figure 5.1. The top level is Canada, and the second level can be divided according

to geographical location, for example, the maritime, the western, Ontario, Quebec,

the Yukon and Northwestern Areas, and areas outside Canada.

Canada Western

British Columbia Prairies

Alberta Saskatchewan Manitoba

Ontario Quebec Atlantic

Maritime New Brunswick Nova Scotia Prince Edward Island

Newfoundland

Figure 5.1: A Concept Hierarchy for the PROVINCE Attribute.

5.1.2 The Dl Database

The Dl database is an Oracle database of awards containhg the AWARD, DIS-

CIPLINE, and AREA tables. The Dl database maintained at the IRIS Center for

Excellence, University of Regina, and resides on a SUN 4 Sparcstation 10 with 32

megabytes main rnemory. It is connected to the network using a LANCE Ethernet

DMA pseudo device.

The AWARD table contains information on awards offered, the award amount,

the recipient, and the area which the university locates. The DISCIPLINE table

contains data on the discipline title, and each title has its own standard code. Area

information is represented in the AREA table.

Detailed schema information of Dl database is given in Appendix A.

5.1.3 The D2 Database

The D2 database is a Microsoft Access database on an IBM compatible per-

sonal computer running WindowsNT 4.0 with a 66 MHz Intel 486 processor and 16

megabytes of rnemory. A database server written using the Open DataBase Connec-

tivity (ODBC) protocol runs in the background, listening for connections on a TCP socket.

The D2 database contains SCHOLARSHIP, COMMITTEE, and GRANT-TYPE

tables. The SCHOLARSHIP table has data similar to that in the AWARD table in

the Dl database, with attribute name clifferences. For example, the attribute DEPT in the Dl database is called DEPARTMENT in the SCHOLARSHIP database. The

COMMITTEE table describes the committee names and their standard codes. Table

5.3 is part of the schema of the GRANT-TYPE table.

Detailed schema information is given in Appendix B.

1 GRANT-TITLE 1 NOT NULL 1 TEXT 1 -- - - - - -- - - - - --

Table 5.3: Schema of the GRANT-TYPE Table.

Type TEXT NUMBER

Attribute Name GMNT-CODE GRANT-ORDER

501.4 The D3 Database

Nul1 NOT NULL NOT NULL

The D3 database is a Mini SQL database. Mini SQL is a lightweight database

management systern designed to provide fast access to stored data with low memory

requirements 1451. The database utilizes a well known TCP socket and accepts mul-

tiple connections. The Mini SQL database utilizes memory mapped 1/0 and cache

techniques to offer rapid access to data.

The D3 database runs on DBLEARN.CS.UREGINA.CA, a SUN 40 MHz Sparc-

station IPX with 16 megabytes of main memory and 198 megabytes of disk space.

The D3 database contains tables related to different organizations. The ORGA- NIZATION table has data on al1 the provinces and areas.

Detailed schema information is given in Appendix C.

Table 5.4: Cornparison Between the AWARD and SCHOLARSHIP Tabks.

M::ORGANMTma

-,ORCORC- - I - N E PROVINCE

Global Attribute AREA-CODE DEPT ORG-CODE

Figure 5.2: Relations among the Three Tables.

D2 (Acceas)

5.1.5 Relationships Between the Three Databases

AREA-CODE DEPARTMENT ORGCODE

D l (Oracle)

The three databases have been designed to have interleaving relations between

each other. The Dl database has an AWARD table, which has the similarities to the

SCHOLARSHIP table in the D2 database. Table 5.4 compares relevant attributes of

the two tables. Figure 5.2 gives the relationships among the AWARD, the SCHOG

ARSHIP and the ORGANIZATION tables.

The ORGANIZATION table of the D3 database contains regional data on each

organisation. The AWARD table in the Dl database has an attribute "ORG-CODE", which has reference in the ORGANIZATION table. The "ORGCODE" attribute of

the SCHOLARSHIP table in the D3 database bas the same reference.

NUMBER TEXT NUMBER

AREA-CODE DEPT ORG-CODE

NUMBER(14) VARCHAR2(35) NUMBER(14)

5.2 Database Registration Process

Figure 5.3: Secure Site Certificate.

When the hypertext link about the on-line database registration is activated, a

new window pops up. It is a site certificate according to the X.509 standard. If the

user accepts the conditions stated on the certificate, he/she can continue with the

connection and registration. Figure 5.3 is a snapshot from the Netscape browser.

The Netscape browser dws not recognize the authonty signing the site certificate

because a test certificate was used. If a site needs the recognition from the Netscape

browser, the site is required to certifiy itself through some recognized authority. The

certificate describes the site authority, encryption method and encryption b e l .

After the user accepts the certificate, the application registration form appears on

the browser; otherwise, the connection fails. Figure 5.4 is a snapshot of the application

form with user input data.

On the registration application form, the user is again reminded of the issue of

connection security. If he/she feels any doubt, hefshe can exit from the registration

process. On Figure 5.4, the required information must be entered by the user. The

Figure 5.4: User Database Registration Application F'orm.

required informat ion includes the database platform, the database name, the login

name, the password, and the host that the database resides on. The optional infor-

mation section allows the user to give a brief description of the database content,

which will allow others to better understanding of the rnappings. The KDID system

cannot create a database agent if any part of the required information is rnissing.

When the user completes al1 required information and submits the form, the data is

encrypted. The database agent retrieves the participating database schema informa-

tion. AI1 tables are retrieved from the database with the SQL command SELECT

TABLENA ME FROM USER- TABLES, assuming that the database supports the

command. This command is supported by Orade. If the retrieved database does not

recognize the command, the local database administrator should supply the names of

the tables in the database.

If the schema retrieval is successful, a twepart HTML form is generated a u t e

matically. Figure 5.5 gives a snapshot of an example HTML form.

Figure 5.5: Form for Scbema Mapping.

As shown in Figure 5.5, the first part of the form describes the global schema.

Each global attribute has a corresponding number and its intended meaning. The

first global attribute is UAMOUNTn and its corresponding meaning is amount of an

award offered to a candidate. The second part of the form shows the local attribute

information. For each local attribute, the user is required to enter the number of

the corresponding global attribute. If there is no mapping between a participating

attribute and any of the global attributes, the default is "NIAn.

Once the user subrnits the form, the KDID system processes the mappings s u p

plied and updates the global schema accordingly, adding in mappings for the new

database. If the mapping supplied is "N/An , no update is executed for that attribute.

Another web page appears inforrning the user that the registration information has

been processed, and an acknowledgment message is issued by the KDID registration

subsystem. Figure 5.6 is a screen snapshot of the acknoledgement message. When

the user get the mail acknowledgment, the registration process is finished.

Figure 5.6: User Database Registration Acknowledgement.

5.3 Typical Knowledge Discovery 'Iàsks

This section illustrate two typicd knowledge discovery tasks for the KDID system.

Both require information from the Dl, D2, and D3 databases.

FROM AWARD a, ORGANIZATION b, COMMITTEE c WHERE a.org-code = b.orgcode AND a.ctee-code = cxteexode AND ( disc-code = "HARDWARE" OR disc-code = "SOFTWARE" )

I SELECT amount, area-code, discmde, dept, orgrname, province, cname

-

Figure 5.7: A SQGlike Query for Discovery Task 1.

SELECT amount, area-code, disc-code, dept, orgname, province, cname FROM AWARD a, ORGANIZATION b, COMMITTEE c WHERE a.org-code = b.orgcode AND a.cteexode = c.ctee_code AND (( discrode >= 23000 and disc-code <= 23500 ))

or (( disc-code >= 24000 and disc-code <=24500 ))

Figure 5.8: SQL Query for Task 1 After Transformation.

5.3.1 Discovery 'Pask 1

The following is a typical discovery task (expressed in English for comprehensibil-

ity):

Analyze t h e relationship between awards offered and t h e discipline area

where t he discipline area can be either hardware related or software

relat ed.

To identify the relationship between the amount of an award and the discipline area

it is necessary to select the amount, area-code, department name, organization code

and cornmittee information first. To specify the task, an SQLlike query can be

constructed, as shown in Figure 5.7.

The query in Figure 5.7 includes high level concepts such as "HARWARE" and

"SOFTWARE", which do not appear in the database as values for the DISC-CODE

attribute. It is necessary to transform these high level concepts to the primitive

level of the data values present in the local database. Using concept hierarchies,

the KDID system substitutes "HARDWAREn into a range of 23000 to 23500 and

'SOFTWARE" into a range of 24000 to 24500. The transformed query is shown in

Figure 5.8.

Figure 5.8 is the query against the global schema. It is translated into a full path

1 SELECT amount , areasode, disc-code, depart ment, org-code, c t ecode

d

FROM AWARD WHERE (( discrode >= 23000 and disc-code <= 23500 ))

or (( discrode >= 24000 and &SC-code <=24500 ))

Figure 5.9: Query 1 for Task 1 for the Dl Database.

SELECT orgname, province, orgcode FROM ORGANIZATION

Figure 5.10: Query 2 for Task 1 for the D3 Database.

query according to the global and local schema mappings in order to decompose it

into a set of local queries.

Figures 5.9, 5.10, and 5.1 1 are the decomposed and transformed queries for the

Dl, D2, and D3 databases.

In Figure 5.7, the global attribute name for department is ''dept" , while in Figure

5.9, it is changed according to the schema mapping to "departmentn . The attribute

name "org-coden in the global schema corresponds to "org-coden in Figure 5.9 and

"orgcode" in Figure 5.10. This mapping overcomes a semantic difference between the

local databases, as discussed in Chapter 4.

5.3.2 Discovery Task 2

Discovery task 2 is as follows:

Find interesting relations between al1 awards, scholarships and discipline where the discipline can be either structural engineering or mechanicd engineering.

This task is more complex than the first. First, it is necessary to retrieve the

AWARD relation from the Dl database and the SCHOLARSHIP relation from the

SELECT cname, cteexode FROM COMMITTEE


D2 database. An SQLlike query is shown in Figure 5.12.

SELECT arnount, area-code, discxode, dept , orgrname, province, cname FROM AWARD al, SCHOLARSHIP a2, ORGANIZATION b, COMMITTEE c WHERE a l .erg-code = b.orgcode AND a2 .erg-code = b.orgcode AND al.cteexode = c.cteemde AND ahteexode = c.cteexode AND ( disc-code = USTRUCTURAL ENGINEERING"

OR discade = "MECHANICAL ENGINEERING" )

Figure 5.12: An SQL-like Query for Discovery Task 2.

The high level concepts "MECHANICAL ENGINEERING" and "STRUCTURAL ENGINEERINGn are transformed into the ranges 00500 - 02000 and 07000 - 08500

respectively. According to the techniques described in Chapter 4, the query in Figure

5.12 is translated and decomposed into four local queries. Figure 5.13 is the query

for the D l database, Figure 5.14 and Figure 5.15 are the queries for the D2 database,

and the query in Figure 5.16 is the query for the D3 database. The data retrieved

from the queries in Figure 5.13 and Figure 5.14 are standardized and appended to

one file.

5.3.3 'Pask Realization Using HTML Forms

The discovery tasks can be specified using HTML forms. Using HTML forms

makes it easier for the end users to perform knowledge discovery tasks without regard

to the details of the query format. If the user had to compose an SQL query for a

complex multidatabase system, it would be easy to make mistakes.

HTMLbased forms need scripts to be activated. Figure 5.17 shows a snapshot of

SELECT amount , area-code, disc-code, depart ment, org-code, ctee-code FROM AWARD WHERE (( disc-code >= 00500 and disc-code <= 02000 ))

or (( disc-code >= 07000 and disc-code <=O8500 ))

Figure 5.13: Query 1 for Task 2 for the Dl DATABASE.

74

1 SELECT amount , areacode, disc-code, dept , org-code, ctee-code

i

FROM SCHOLARSHIP WHERE (( disc-code >= 00500 and disc-code <= 02000 ))


SELECT cnarne, cteemde FROM COMMITTEE


a sample form for the discovery task 1. The user can select a concept hierarchy and

target attributes. He/she can specify the attribute threshold and the tuple threshold.

The attn'bute threshold specifies the maximum number of distinct values that an

attribute may have in the prime relation. In DB-Discover the discovered result from

a database is called the prime relation. An attribute may have a many possible values

if it is numerical, as is the AMOUNT attribute in the AWARD table, and it may also

have many values if it is a discrete valued attribute. The DB-Discover system reduces

the number of distinct values for numerical attributes by mapping the values into a

finite number of ranges. Detailed information on attribute threshold is given in 1171.

The default threshold for ail attributes retrieved is 10 in Figure 5.17. The user has

options to change from 5 to 20. On the fom in Figure 5.17, the user can also change

tuple threshold, for tuples to retrieve in the prime relation. After the nurnber of

distinct d u e s for each attribute in the relation has b e n reduced to the attribute

thresholds in the first round of generalization, the number of tuples in the prime

relation is compared to the tuple threshold. The relation is generalized repeatedly

until the number of tuples is less than the tuple threshold.

The user might want the total number of items for non-numeric data and the total

of the values for numeric data. On the form in Figure 5.17, the user has options to

SELECT orgaame, province, orgcode FROM ORGANIZATION


Figure 5.17: HTML Fom for Discovery Task 1.

decide which attribute should be summed by the system. If an attribute is specified

to be sumrned, the values of that attribute are surnmed as it is retrieved from the

database.

Once the user specifies al l parameters and MIS in the predicates, he/she can submit

the form to conduct the discovery task. Figure 5.18 is a snapshot of the form in Figure

5.17 with user input data.

The intermediate steps and results which includes the communications between

participating databases and the KDID system, data retrievd, table creation, data

insertion, and knowledge discovery is not displayed for the user. The final result,

that is, the discover4 knowledge is displayed on another HTML page, which can be

scrolled or printed using the Netscape browser. Figure 5.19 is the snapshot of the

final result for the discovery task in Figure 5.7.

The results shown in Figure 5.19 are at a higher conceptual level than the data

stored in the multidatabase.

Figure 5.20 shows data retrieved directly from the multidatabase without any

generalization. The amounts shown in Figure 5.19 are generalized into ranges while

those in Figure 5.20 are numbers. As well, the AREA-CODES in Figure 5.19 are

more intuitive than the discrete numbers in Figure 5.20.

5.4 Performance Analysis

In this section, speed is analyzed with regard to the use of single versus multiple

dat abase agents and data value st andardizat ion using the lookup procedure.

5.4.1 Parallel Data Retrievd from Participating Databases

Two implementation techniques for data retrieval from al l participating databases

were considered. In the sequential implementation, after the successful query of the

first database, the second database agent is created. The global query is decomposed

Figure 5.18: HTML Form for Discovery Task 1 with User Data.

Figure 5.19: Final Result for Discovery Task 1.

16530 864 24004 Computer Science Br i t i sh Columbia Animal Biology 16651 850 24005 Computing Science Br i t i sh Columbia Animal Biology 17000 862 24005 Comp. & Info. Sc i . B r i t i sh Columbia Animal Biology 25178 864 24007 Compueer Science Br i t i sh Columbia Animal Biology 12000 121 24004 Cornputer Science Br i t i sh Columbia A n i m a l Biology 11000 862 24005 Computer Science Br i t i sh Columbia A n i m a l Biology (deleted) 7826 864 24004 Electrical Engineering Alberta Animal Biology 18530 861 24006 Computer Science Alberta Animal Biology (dele t ed) 18530 861 24006 Computer Science Saakat chewan Animal Biology 64300 850 24500 Computer Science Saskatchewan Animal Biology (deleted)

Figure 5.20: Retrieved Data without Generdization.

into a set of local queries, but only one database agent is activated and al1 other

agents are idle.

The pardel irnplementation is based on the observation that al1 database agents

can be created at the same time and no interaction among them is required.

The sequential and parallel retrieval methods have been both implemented and

empirical timing tests have been run for varied input sizes. In this section, the results

of those tests are presented.

The program sets up a signal function before it creates any database agents.

- - - - -

TabIe 5.5: Time for Sequential and ParaIlel Retrieval.

80

Number of Tuples

120,846 171,409 230,630 317,353 549,562

Time for Pardel

Ret rieval (sec) 199.038 463.220

1053.521 1398.433 2890.432

Time for Sequent i d Retrieval (sec) 1420.878 1770.956 2079.330 2777.899 4621.859

Performance Ratio

(Sequential/Parallel) 7.139 3.823 1.974 1.986 1.599

1 Memory 1 Dictionary 1 Number of ( Minimum 1 Maximum 1 Average 1 size S b tuples time (sec) time (sec) time (sec)

32K 23,788 23,788 0.155 0.317 0.189

--aq 23,788 2,728,364 30.035 32.373 31.761 32K 23,788 4,092,556 46.727 51.569 49.068 32K 23,788 6,566,967 81.711 84.507 82.955

Table 5.6: Standardkation Time with 32K Memory and a 23788-Item Dictionary.

According to the number n of databases passed in, the program forks n children and

creates n agents. After al1 agents finish retrieving, the parent pro- gets a signal

and continues its execution. In the implementation the fork() system cal1 is used.

forko makes an identical copy of the calling program with a new process ID number.

fork() returns a zero to the new task that is the child process, and returns the process

ID of the child to the parent process. Given a dis'covery task, the tuple threshold has

been varied in order to change the retrieval size.

The test varied the number of tuples retrieved from the three participating databases

from 120,000 tuples to 540,000 tuples. The amount of data retrieved increased from

5 Megabytes to 100 Megabytes. Each of the queries sent out to the participating

database has three attributes. The timing results are presented in Table 5.5, and a

graph of these results is shown in Figure 5.21. For Table 5.5, the retrieval methods

are listed in the top row and the retrieved number of tuples in the left hand column.

Cells of the table represent retrieval time in seconds. The last column gives the ratio

of the sequential to parallel retrieval times.

As shown in the graph in Figure 5.21, for the same query, the parallel method is

two to seven times faster than the sequential method for the same query. We observe

that the time required for both the sequential and parallel query approaches increases

as more tuples are retrieved.

5 A.2 Data Value St andardizat ion

In this subsection, the results obtained in resolving the data value difference prob

lem for various sizes of the dictionary and the input tuple set are presented. The tests

Figure 5.21 : Test Time for Parallel and Sequential Retrieval.

-- - -

Table 5.7: Standardkation Time with 64K Memory and a 4757ô-Item Dictionary.

were conducted on a Silicon Graphics 0 2 with a 174 MHz processor and 64 Megabytes

of main memory. Three parameters are varied in the tests. The first is the allocated

memory size for the bucket to store the items in the dictionary. The size is limited by

the amount of memory available for dynarnic allocation by the malloc function. The

dictionary size corresponds to the number of items of a dictionary for an attribute.

The number of tuples is varied from 23,788 to 6,566,967.

The timing results are presented in Tables 5.6, 5.7, and 5.8. In these tables, the

minimum, maximum and average times for 5 runs are presented. Time is recorded

in seconds. In Table 5.6, the memory allocated for the bucket is 32 Kbytes, and

the dictionary has 23,788 items. In Table 5.7, the bucket size is increased to 64

Kbytes, and the dictionary size is doubled to 47,576. The bucket size in Table 5.8

is not changed, but the dictionary size is increased to 127,576. The standardization

time for an attribute with 6 million d u e s is slightly more than one minute. When

Average time (sec)

0.177 30.798

memory size is increased, only a small speedup occurs; for example, with 2,728,364

input tuples, the speedup from Tables 5.6 and 5.7 is 1.03. With the dictionary size

increased and the memory size remained the same, the speed is slightly decreased.

Memory size 64K 64K

The analysis of the timing results shows that the retrieved data can be standard-

ized quickly with the standardization algorithm presented in Section 4.5.1. The hash function was not changed in the tests. The results show that a 23,788-item dictionary

and 32K of memory are adequate for standardizing up to 6.5 million tuples.

Minimum time (sec)

0.141 29.126

Maximum time (sec)

0.315 33.016

Dictionary S k

47,576 47.576

Numberof tuplee

23,788 2,728,364

1 Memory 1 Dictionary 1 Number of 1 Minimum 1 Maximum 1 Average 1

Table 5.8: S tandardization Time with 64K Memory and a 127576-Item Dictionary.

aize 64K 64K

5.5 Discussion

This chapter has described how discovery tasks are processed by the KDID system,

as well as the process for registering a database on-line, and a performance analysis

for parallel versus sequential retrieval and data'value standardization. Three test

databases were created to show how the KDID system works. The test data cornes

h m the NSERC database, and the data has b e n partitioned to simulate some of

the complexities of accessing multiple databases wi t h KDID.

Size 127,576 127.576

The on-line database registration process shows t hat the Internet connection be-

tween the participating database and client are secure and that the schema informa-

tion retrieval is fast and easy as long as the required parameters are supplied. Once

a known type of database is about to be contacted, a database agent is created and

tries to locate the site information of that database on the Internet.

The user can specify discovery tasks using HTML forms; details of the composition

of the task are handled by the KDID. The user has options to specify the parameters

for the discovery task. The KDID system dows the user to perform discovery tasks

on multiple databases as if they were a single integrated database.

tuples 23,788

2.728.364

The analysis for the parallel versus the sequential retrieval shows that for large

databases, the parallel retrieval method saves time drarnatically, thus improving the

performance of the KDID system.

time (sec)

0.142 29.057

time (sec) 0.308

35.482

time (sec) 0.148

32.521

Chapter 6

Conclusion

In this chapter, the thesis is summarized, the contri butions are identified, and

suggestions for future research are given.

6.1 Summary

This thesis has presented the KDID system for knowledge discovery in Internet

databases. In particular, it has described the overall system architecture, specific

design issues, and implementation of major parts of the KDID system , as well as

on-line dat abaae registration, query decomposition, pardel retrieval, and network

database agents.

KDID is intended as a pr&f-concept system showing that the overall approach

is feasible. It has four major components: the Internet Front-end, the Interface

Module, the Multidatabase Module, and the KDD Application Module. The Internet

Front-end is the primary communication medium between the end users and the

KDID system. The Internet Front-end is accessed using a web browser, such as

Netscape Navigator. The browser produces a form data set corresponding to the

discovery task specified by the user. The Interface Module parses this form data set

and composes a global query. The Multidatabase Module translates the global query

into an interna1 query and then decomposes i t into a set of local queries. It also

creates local database agents and monitors the data retrieval process. The retrieved

data is standardized and uploaded into a rnultidi$abase. After all local queries have

been processecl, the KDD application is activated, and discovered knowledge is passed

to the Internet Fkont-end for presentation to the end user.

The on-line database registration process makes it possible to query and m m i p

ulate meta-data, that is, information about database schemas and their intended

meaning. The Internet connection between the client of the KDID system and a

local database server is a secure one based on the SSL protocol. The client and

server use the SSL Handshake Protocol described, in [63]. SSL takes data to be trans-

mitted, fragments the data into manageable blocks, and encrypts them before any

transmission occurs. The user is asked about the required information concerning

the local database, and the KDID system creates a database agent to query the local

schema information. If the schema retrieval is successful, an HTML form is gener-

ated autornatically. The local user is asked to map the local attributes to their global

counterparts. According to the mappings supplied by the user, the integrated global

schema is updated.

The query decomposition algorithm ensures that no cross-database joins are per-

formed. If a cross-database join is present in the global query, the attributes in the

join condition are selected from each database separately. The join condition is exe-

cuted in the multidatabase after the data has been retrieved. Given a global query,

SELECT ATT FROM TABLES WHERE CONDS, a set of local queries are generated

in the form of SELECT ATTDB..T FROM T WHERE CONDSDB..T. ATTDB..T is a

list of attributes belonging to a single database. If the list of CONDSDB.r is empty,

the local query has no predicates and the keyword where is omitted. The exarnple in

Section 4.5.2 shows how the algorithm works.

Data integration issues make constructing a multidatabase difficult. The types of

data integration, the reason for the differences, and possible solutions are discussed in

Section 3.3.2. Mechanisms for resolving data value differences are proposed in Section

4.5 and an exarnple shows that the mechanisrns are feasible.

Three different database agents are impIemented in the prototype KDID system.

The Oracle database agent is irnplemented using embedded SQL, the Mini SQL agent

is implemented using the API supplied with the Mini SQL system, and the Microsoft

Access agent is implemented using the ODBC protocol. Al1 retrieved data are con-

verted into a standard format for further processing.

The section on data retrieval in parallel shows that this method is faster than the

sequential retrieval met hod.

6.2 Contributions

The research involved in this thesis makes the following original contributions.

1. The KDID system for knowledge discovery from different Internet databases

was proposed and a prototype system was implemented. Using a single, inte-

grated interface, an existing knowledge discovery technique was applied to data

collected from more than one relational database in the Internet. The system

makes it possible for administrators of relational databases scattered in the In-

ternet with similar contents to cooperate with each other, and KDID makes it

easier to summarize the complete set of related data.

2. On-line database registration makes meta-data manipulation feasible and rela-

tively easy. Previous research has aseumed that the mappings between database

schernas are provided [59] 176) 1811 1831. Here, schema mapping is accomplished

by creating database agents that generate HTML forms to obtain schema m a p

ping information from the local database administrators.

3. Participation into the KDID system is easy and voluntary. Once a database

adrninistrator has decided to participate, only the mappings between global and

local schemas at the central registry are updated. There is no need to change

any local database or to inform aay other participants about the new database.

Similady, an administrator can remove a database from KDID relatively easily,

although this thesis has not defined a process for deleting correspondences from

the general schema.

4. Re t r i ed of data from multiple databases can be conducted in parallel. In this

research, we showed the benefit that results from this approach in the KDID

system.

6.3 Areas for Future Research

The implemented KDID system can only conduct knowledge discovery in rela-

tional databases. In future research work, other types of databases could be consid-

ered, including object-oriented databases, deductive databases and temporal databases.

The data models describing these database types are more cornplex than the relational

model. As well, the the variations in the structures of object-oriented and deductive

databases will complicate the process of decomposing queries and integrating results.

Another factor is that existing knowledge discovery tools mainly focus on relational

databases. Thus, knowledge discovery tools will need to be created or adapted for

use with object-oriented and deductive databases.

Although several integration problems were described and examples of each were

presented, a solution to only one of these problems, differences in data d u e s , was

developed. Future research should focus on overcoming structural and semantic dif-

feren ces.

Given the proliferation of databases, it would be useful to deveIop tools that

can automate the specification of database schema rnappings. The on-line database

registration method only partially automates the process, because after the local

dat abase schema information has b e n retrieved, the local dat abase administrat or is

required to supply the mapping. If heuristics were developed to map local schemas to

the global schema automatically, the process of schema mapping would be simplified.

Some research work has been conducted by [72] [85]. A framework for supporting

automated data integration is being developed.

Bibliography

[l] Agrawal, R., Imielinski, T., and Swami A., UMining Associations between Sets

of Items in Massive Databases," Proceedings of the ACM SIGMOD International

Confennce on Management of Data, pp. 207-216, Washington D.C., May 1993.

[2] Agrawal, R., Maunila, H., Srikant, R., Toivonen H., and Verkarno, A. I., "Fast

Discovery of Association Rules," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P.,

U thurusamy, R. ( 4 s ) , Advances in Know ledge DLPcovery and Data Mining, pp.

307-328, AAAI/MIT Press, 1996.

[3] The Apache Group, "Apache HTTP Server Project ," http://www .apache.org/.

[4] Arens, Y., Chen, C. Y., Hsu, C. N., and Knoblock, C. A., "Retrieving and

Integrating Data from Multiple Information Sources," International Journal of

Intelligent and Cooperative Information Systems, 2(2):127-158, 1993.

[5] Bal, H. E., Kaashoek, M. F., Tanenbaum, A. S., and Jansen, J., "Replication

Techniques for Speeding up Parallel Application on Distributed Systems," Con-

currency Practice and Ezperience, 4(5):337-355, 1992.

[6] Barber, D. B., "Attribute Selection Strategies for Attributeoriented Generaliza-

tion," M. Sc. thesis, University of Regina, 1997.

[7] Batini, C., Lemerini, M., and Navathe, S., "A Comparative Analysis of Method-

ologies of Dat abase Schema Integration," A CM Computing Surue ys, 18(4) :32%

364, 1986.

[8] Bernera-Lee, T., and Connolly, D., "Hypertext Markup Language - 2.0," RFC

1866, MIT/W3C, http://ecco.bsee.swin.edu.au/text/html-spec, November 1995.

[9] Berners-Lee, T., Fielding, R., and F'ryttyk, H., "Hypertext Tkansfer Protocol - HTTP/l.O," http://www.ics.uci.edu/pub/ietf/http/rfe, 1996.

[IO] Berry, M., "Large Scale Singular Value Computations," International Journal of

Supercornputer Applications, 6(1):1549, 1992.

Ill] Bestavros, A., Demand-Qased Document Dissemination for the World- Wide Web,

Technical Report BU-CS-95003, Computer Science Department, Boston Univer-

sity, 1995.

[12] Bestavros, A., Carter, R., Crovella, M. E., Cunha, C. R., Heddaya, A., and

Mirdad, S. A., Application-Level Document Caching in the Internet, Technical

Report BU-CS-95002, Computer Science Department, Boston University, 1995.

[13] Borgida, A., Brachman, R J., McGuimess, D. L., and Resnick, L. A., "CLAS-

SIC: A Structural Data Mode1 for O b j e ~ t s , ~ Proceedings ACM SIGMOD Sym-

posium on the Management of Data, pp. 58-67, 1989.

[14] Brachman, R. J. and Anand, T., uThe Process of Knowledge Discovery in

Databases", Fayyad, U. M., Piatetsky-Shapiro, Smyth, P., Uthurusamy, R. (eds),

Advances in Knowledge Discovery and Data Mining, pp. 37-57, AAAIIMIT

Press, 1996.

[15] Breitbart, Y., Olson, P. L., and Thompson G. R., uDatabase Integration in a

Distributed Heterogeneous Database Systemn , Hurson A. R., Bright, M. W., and

Pakzad, S., (eds) , Multidotabase: an Advanced Solution for Global Information

Sharing, IEEE Computer Society Press, 1994.

[16] Buenament, P., Davidson, S. B., Hart, K., and Overton, C., "A Data Transfor-

mation System for BioIogical Data Sources," Proceedings of International Con-

ference on Very Large Data Bases, pp. 158-169, 1995.

[17] Carter, C. L., Hamilton, H. J., and Cercone, N., The Soflwan Architecture

of DBLEARN, Technical Report CS94-04, Depart ment of Computer Science,

University of Regina, January, 1994.

[18] Carter, C. L. and Hamilton, H. J., Performance Eualuation of Attn'bute-On'ented

Algorithms for Knowledge Dkcovery fmm Dutuboses: Edended Report, Technical

Report 95-6, Department of Computer Science, University of Regina, 1995.

[19] Chankhunthod, A., Danzig, P., Neerdaels, C., Schwartz, M. F., and Worrell,

K. J., "A Hierarchical Internet Object Cache," USENIX 1996 Annual Technical

Conference, San Diego, CA, January, 1996.

[20] Chen, M., Han, J., and Yu, P., "Data Mining: An Overview from a Database

Perspective," IEEE Transactions on Knowledge Discovery and Data Engineer-

ing, 8(6):866-882, 1996.

[21] Crovella, M. E. and Carter, R. L., Dynarnic Server Selection in the Internet,

Technical Report BU-CS-95014, Computer Science Depart ment, Boston Uni-

versity, 1995.

[22] Date, C. J., An Introduction to Database Systems, Addison-Wesley, 1990.

[23] Dayal, U. and Hwang, H. Y ., "View Definition and Generalisation for Database

Integration in a Multidat abase System," IEEE Transactions on Software EngG

neering, SE10(6):628-644, 1984.

[24] Deen, S. M., Amin, R. R., and Taylor, M. C., "Data Integration in Distributed

Databases", Hunon A. R., Bright, M. W., and Pakzad, S., (eds), Multidatabase:

an Advanced Solution for Global Infomation Sharing, lEEE Computer Society

Press, 1994.

[25] Deerwester, S., Dumais, S. T., F'urnas, G. W., and Landauer, T. K., "Indexing

by Latent Semantic Analysis," Journal of the American Society for Information

Science, 41 (6):391-407, 1990.

1261 Desai, B. C., An Introduction to Database Systems, West Publishing, 1990.

[27] Dumais, S. T., 'Enhancing Performance in Latent Semantic Indexing (LSI) Re- trieval," Behavior Researeh Methods, Instruments and Cornputers, 23(2):229-236,

1991.

[28] Dumais, S. T., h a s , G. W., Lanauer, T. K., Deerwester, S., and Harshman,

R., uUsing Latent Semantic Andysis to Improve Access to Textual Informa-

tion," Proceedings of ACM CH1'88 Conference on Human Factors in Computing

Systems, pp. 281-285, 1988.

1291 El-Medani, G . , A Visual Query Facilit y for Multimedia Databases, Technical

Report TR95-18, Department of Cornputer Science, University of Alberta, 1995.

[30] Fang, D., Hammer, J., McLeod, D., and Si. A., 'Remote-Exchange: An A p proach to Controlled Sharing among Autonomous, Heterogeneous Database Sys-

tems ," Proceedings of the IEEE Spring Compcon. IEEE, San Francisco, February

1991.

[31] Fang, D., Hammer, J., and McLeod, D., 'An Approaeh to Behavior Sharing in

Federated Database Systems," M. T. ozsu, U. Dayal, and P. V Alduriez (eds), Distributed Object Management, Morgan Kaufman, 1993.

1321 Fayyad, U., Djorgovski, S., and Weir, N., uAutomating the Andysis and Cat-

aloging of Sky Surveys," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P., Uthu-

rusamy, R. (eds), Advances in Knowledge Discouery and Data Mining, pp. 471-

493, AAAIIMIT Press, 1996.

[33] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., "From Data Mining to Knowl-

edge Discovery: An Overview," Fayyad, U. M., Piatetsky-Shapiro, Smyth, P.,

Uthurusamy, R. (eds), Advances in Knowledge Discouery and Data Mining, pp.

1-34, AAAIIMIT Press, 1996.

[34] Frawley, W. J., Piatetsky-Shapiro, G., and Matheus, C. J., "Knowledge Dis- covery in Databases: An Overview," Piatetsky-Shapiro, G. and Frawley, W. J.

(eds), Knowledge Discouery in Databases, pp. 1-27, AAAIIMIT Press, 1991.

[35] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T., "The V* cabulary Problem in Human-system Communication," Communications of the

ACM, 30(11):964971, 1987.

[36] Gunter, G., UThe Mixed Powerdomain," Theoretical Cornputer Science, 103:311-

334, 1992.

[37] Hammer, J., McLeod, D., and Si. A., "Object Discovery and Unification in a

Federated Database System," Proceedings of the Workshop on Intemperability

of Database Systems and Database Applications, pp. 3-18, Swiss Information

Society, University of Ribourg, Switzerland, October 1993.

[38] Hammer, J., Garcia-Molina, H., Labio, W., Widom, J., and Zhuge, Y., 'The

Stanford Data Warehousing Project," Data Engineering Bulletin, 18(2):41-48,

June 1995.

(391 Hammer, J., Garcia-Molina, H., Ireland, K., Papalconstantinou, Y., Ullman, J.,

and Widorn, J., "Information Translation, Mediation, and Mosaic-Based Brows-

ing in the TSIMMIS System," Proceedings of the ACM SIGMOD International

Conference on Management of Data, San Jose, California, June 1995.

[40] Han, J. and Fu, Y., 'Attribute-Oriented Induction in Data Mining," Fayyad,

U. M., Piatetsky-Shapiro, Smyth, P., Uthunisamy, R. (eds), Advances in Knowl-

edge Discovery and Data Mining, pp. 394421, AAAI/MIT Press, 1996.

1411 Han, J., Zaiane, O. R., and Fu, Y., Resource and Knowledge Discovery in Global

Information System: A Multiple Layered Database Approuch, Technicd Report

CMPT TR94-10, School of Computing Science, Simon Fraser University, Canada,

1994.

[42] Han, J., Fu, Y., and Ng, R., 'Cooperative Query Answering Using Multiple

Layered Databases," Proceedings of the Second International Conference on Co-

operative Information System, Toronto, Canada, May 1994.

1431 Horspool, R. N., The Berkeley UNIX Environment, Printice-Hall, 1992.

[44] Hsu, C. and Knoblock, C. A., "Discovering Robust Knowledge from Dynamic

Closed-World Data," Pmceedings of the Thirteenth National Conference on Ar- tificial Intelligence, Portland, Oregon, 1996.

[45] Hughes, D. J., Mini SQL: A Lightweight Databuse Engine, Hughes Technologies

Pty, 1996.

[46] Hull, R., "Managing Semantic Heterogeneity in Databases: A Theoretical

Perspective," ACM SIGACT-SIGMOD-SIGART Symposium on Principles of

Database Systems, PODS 1997, pp. 51-55, 1997.

[47] Hurson A. R., Bright, M. W., and Pakzad, S., Multidatabase: an Advanced

Solution for Global Information Sharing, IEEE Computer Society Press, 1994.

[48] Internet Resources, http://132.15.104.104/0kinawa/NetInfo/Resources.html.

[49] Knight-Ridder Information, "Gale Directory of Databases," http://www.rs.ch/

krinfo/products/datastar/sheet8/GDDB.HTM.

1501 Krochmal, J., LAN Applications, New Rjders Publishing, Carmel, Indiana USA,

1993.

[51] Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D., uAns~ering Queries

Using Views," Proceedings ACM Symposium on Principles of Database Systems,

pp. 95104, 1995.

[52] Levy, A. Y., Rajaraman, A., and Ordille, J. J., "Querying Heterogeneous Infor-

mation Sources Using Source Descriptions," Proceedings of International Con-

ference on Very Large Data Bases, pp. 251-262, 1996.

1531 Levy, A. Y., Rajaaman, and Ullman, J. D., "Answering Queries Using Lirn-

i ted External Query f rocessors," Proceedings A CM Symposium on Principles of

Database Systems, pp. 227-237, 1996.

1541 Li, S. H. and Danzig, P. B., TWO-~imensi*nal Visualization for Internet Re- source Discouery, Technical Report USGCS-96, Computer Science Department,

University of Southern California, 1996.

[55] Li, S. H., and Danzig, P. B., "Vocabulary Problem in Internet Resource Discov-

e r ~ , ~ Second International Workshop on Nezt Generation information Technolo-

gies and Systems, 1995.

1561 Li, S. H. and Danzig, P. B., Vintage: A Visual Information RetReual Interface

Based on Latent Semantic Indczing, Teehnjcd Report USCCS96xx, Cornputer

Science Department, University of Southern California, 1996.

(571 Libkin, L., 'Approximation in Databases," Pmceedings of the International Con-

ference on Database Theory, pp. 411-424, 1995.

[58] Linden B., Oracle 7 Semer SQL Language Refennce Manual, Orade Corporation,

December 1992.

159) Litwin, W., Mark, L., and Roussopolos, N.,"Interoperability on Multiple Au- tonornous data base^,^ A CM Computing Sume ys, 22(3):267-293, 1990.

[60] Meng, W., and Yu, C., Query Processing in Multiple Database Systerns, pp.

551-572, Addison-Wesley, Reading, MA, 1995.

[61] Moc~ , K., "Adaptive User Models for Intelligent Information Filtering," Proceed-

ings of the Third Golden West International Conference on Intelligent Systems,

Las Vegas, Nevada, 1994.

[62] Murthy, S. K., Kasif, S., and Salzberg, S., "A System for Induction of Oblique

Decision Trees ," Journal of A dificial Intelligence Research, 2: 1-32,1994.

[63] Netscape C o ~ u n i c a t i o n s Corporation, "The Secure Sockets Layer Protocol

(SSL)," ht tp: //home.netscape.com/info/security-dot-html.

[64] Network Wizards, "Internet Domain Survey, January 1997," http://wurw.

nw. com/zone/WWW/mport. html.

[65] Nua Internet Survey, "Internet Survey February 1997," http://www.nua.ie/ sur-

veys/WhatsNew.html#February.

1661 Nural, S., Koksal, P., Oz-, F., and Dogac, A., 'Query Decomposition and

Processing in Multidatabase Systems," Pmceedings of OODBMS Symposium of

the European Joint Conference on Engineering Systems Design and Analysis,

Mont pelier , July 1996.

1671 Object-Orien tation FAQ; see http://www. bjlkent .edu. tr/Online/oofaq/~~-fq-S-

3.5.htrn

[68] Obraczka, K., Danzig, P. B., and Li, S. H., "Internet Resource Diswvery Ser-

vices," IEEE Computer, 26(9):8-22, September 1993.

[69] Ozsu, M. T. and Valduriez, P., Principles of Distributed Database Systems,

Prentice-Hall, 1991.

[70] Paredaens, J., Van den Bussche, J., and Van Gucht, D., "Towards a Theory

of Spatial Database Queries," Proceedings of the 13th ACM Symposium on the

Principles of Database Systems, pp. 279-288, 1994.

[71] Read, R. L., Fussell, D. S., and Silberschatz, A., Multi-Resolution Relational

Data Model," Proceedings of the Eighteenth International Conference on Very

Large Data Bases, pp. 134150, Vanwuver, Canada, 1992.

(721 Reference Architecture for the Intelligent Integrat ion of Informat ion, ver-

sion 2.0, draft, 1995. Developed by the 13 Program of DARPA; see http://

dc.isx.com/I3/html/briefs/I3brief.html#ref.

[73] Reiter, R., "Towards a Logical reconstruction of Relational Database Theory,"

Bordie, M. L., et al. editors, On Conceptual Modeling, 1984.

[74] Rivera, C. B. and Carter, C. L., A Tutorial Guide to DB-Discover, Version

2.0, Technical Report CS95-05, Department of Computer Science, University of

Regina, July, 1995.

[75] Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval,

McGraw-Hill, 1983.

[76] Sheth, A. P. and Larson, J . A., UFederated Database Systems for Managing Dis-

tributed, Heterogeneous, and Autonomous Databases," A CM Computing Sur-

veys, 22(3):183-236, 1990.

[77] Srilrant, R. and Agrawal, R., uMining Quantitative Association Rules in Large

Relational Tables," Pmceedings of the ACM SIGMOD Conference on Manage-

ment of Data, Montreal, Canada, June 1996.

[78] Srikant, R. and Agrawal, R., ""Mining Generalized Association Rules," Pro-

ceedings of the 21st International Conference on Very Large Databases, Zurich,

Switzerland, September, 1995.

[79] Srikant, R. and Agrawal, R., "Fast Algorithms for Mining Association Rules,"

Proceedings of the 20th International Conference on Vety Large Databases, San-

tiago, Chile, September, 1994.

[80] Steindl C., "1s Interoperability Achievable With ODBC?" Institute of Computer

Science, Jonahhes Kepler University Linz, Austria, 1996.

[81] Thomas, G., UHeterogeneous Distributed Database Systems for Production Use,"

ACM Computing Surveys, 22(3):237-266, 1990.

[82] Tsai, P. S. M. and Chen, A. L. P., 'Concept Hierarchies for Database Integration

in a Multidatabase System," International Conference on Management of Data,

1994.

[83] Ullman, J. D., "Information Integration Using Logical Views," Proceedings of

International Conference on Database Theory, pp. 19-40, 1997.

1841 Vaghani, J., Ramarnohanarao, K., Kemp, D. B., Somogyi, Z., Peter J. Stuckey,

P. J., Tim S. Leask, T. S., and James Harland, UThe Aditi Deductive Database

System," VLDB Journal, 3(2):245-288, 1994:

1851 Wiederhold, G., "Forward: Intelligent Integration of Information," Journal of

Intelligent Information Systerns, 6(2/3):281-291, 1996.

[86] "What is X.509?" RSA Lakmtories, Inc.; See http://www.rsa.com/nalabs

/faq/ql65.html.

[87] Zhuge, Y., Garcia-Molina, H., Hammer, J . , and Widom. J., uView Maintenance

in a Warehousing Environment ," Pmceedings' of the A CM SIGMOD international

Conference on Management of Data, pp. 316327, San Jose, California, June

1995.

Appendix A

Schema Information for the D l Database

A.l The AWARD Table

I Attribute Name I Nul1 I Type I AMOUNT AREA-CODE

COMP-YR CTEE-CODE DEPT DISC-CODE FISCAL-YR GRANT-CODE INSTAL ORG-CODE PROJECT RECPJAME

Table A.l: Schema of the AWARD Table

NOT NULL

NOT NULL

NOT NULL

N U M B E R ( ~ ~ ) VARCHAR2(35) NUMBER(14) NUMBER(14) VARCHARZ(6) VARCHAR2(6) NUMBER(14) VARCHAR2(2OO) VARCHAR2(25)

A.2 The DISCIPLINE Dble

Table A.2: Scherna of the DISCIPLINE Table

A.3 The AREA a b l e

Type NUMBER(14) VARCHAM(63)

Attribute Name DISC-CODE DISC-TITLE

1 Attribute Name 1 Nul1 1 Type 1

N d NOT NULL NOT NULL

- -

Table A .3: Schema of the AREA Table

AREA-CODE AREA-TITLE

N U M B E R ~ ~ VARCHAR2(65)

Appendix B

Schema Information of Database D2

B.1 The SCHOLARSHIP mble

Table B.l: Schema of the SCHOLARSHIP Table

Type NUMBER NUMBER NUMBER NUMBER NUMBER TEXT NUMBER NUMBER TEXT TEXT NUMBER TEXT TEXT

Attribute Name AMOUNT AREA-CODE CNT2 COMP-YR CTEE-CODE DEPARTMENT DISC-CODE FISCAL-YR GRANT-CODE INSTAL ORGCODE

Nul1

NOT NULL

NOT NULL

NOT NULL PROJECT 1 RECPJTAME 1 NOT NULL

B.2 The COMMITTEE Table

I Attribute Name I N d 1 Type I

Table B.2: Schema of the COMMITTEE Table

CNAME CTEE-CODE

B.3 The GRANT-TYPE Tàble

f

NOT NULL i TEXT NOT NULL f NUMBER

Table B.3: Schema of the GRANT-TYPE Table

Appendix C

Schema Information of Database D3

C.l The ORGANIZATION B b l e

Table C.l: Schema of the ORGANIZATION Table

ORG-CODE ORGNAME PROVINCE

' IN+EGER NOT NULL NOT NULL

CHAR(51) CHAR(17)

Appendix D

Glossary

Attribute-oriented induction: An attribute-oriented induction algorithm

takes as input a relation retrieved from a database and generalizes the data guided

by a set of concept hierarchies.

Characterist ic discovery task: A characteristic discovery task is one that

requires the finding of interesting relationships between various attributes of one or

more relations in the database.

Client-sewer model: A form of distributed computing that divides the appli-

cation processing between a client and a aerver that are connected by a network.

Concept hierarchy: A concept hierarchy is a tree of concepts arranged hierar-

chically according to generality.

Data classification: Data classification classifies data based on the values of

certain attributes.

Database agent: A database forms a gateway to a local database.

Decision tree: A decision tree is generated from a training set. A classification

algorithm takes the training set of attribute values and a class as input.

Form: A form is a template for a form data set and an associated method and

action URL.

Form data set: A form data set is a sequence of (NAME, VALUE) pairs.

miIl-path query: A query that specifies the database narne and relation narne

for each attribute mentioned. The form of an attribute is database: relation.attribute.

Global query: A global query is a full-path query that may include constant

predicates and join conditions.

Global schema: A global schema is created by the integration of multiple

export schemas.

Homonym: Homonyms are different attributes having identical names.

Hyper text Markup Language: The Hypertext Markup Language is a simple

data format used to create hypertext documents that are portable from one platform

to another.

Hyper text 'Iimsfer Protocol: The Hypertext Ransfer Protocol is an application-

level protocol for distri buted, collaborat ive, hypermedia informat ion systems.

Internet: The Internet is the worldwide collection of inter-connected computer

networks and gateways that use the Internet Protocol (IP) and function as a single

cooperative network.

Internet database: An Internet database is a database with access provided

to Internet users.

Knowledge discovery: Knowledge discovery is the non-trivial process of iden-

tifying valid, novel, potentially usefd, and ultimately understandable patterns in

data.

Knowledge discovery in Internet databaaes: Knowledge discovery in In-

ternet databases concerns the application of techniques for knowledge discovery in

dat abases to multiple databases amilable on the Internet.

Local database query: A local database query has only the constant conditions

with the join condition decomposed into attributes to be retrieved.

Macro: A macro is a list of actions to be perforrned by the database system.

Module language: A module language is a srnall language for expressing SQL operation in pure SQL syntactic form.

Multidatabase system: A muhidatabase provides integrated access to au-

tonornous, heterogeneous databases via a single, relatively simple request.

Multiple layered database: A multiple layered database is a database formed

by generalizat ion and transformation of information.

Schema Information Manager: The schema information manager coordinates

the global database agent and the local database agents.

Synonym: Synonyms are attributes in different databases that have the same

meaning although they may have different namea.

Secure Sockets Layer: The Secure Socket Layer transport protocol provides

data encryption, server authent icat ion, message integri ty, and opt ional client aut hen-

tication for a TCP/IP connection.

World Wide Web: The World Wide Web is a collection of mutually referencing

hypertext documents scattered across the Internet, and serviced by HTTP servers.

L L , LLL

APPLIED y 4 IMAGE. lnc = 1653 East Main Street - -. , , Rochester, NY 14609 USA -- -- - - Phone: 7161482-0300 -- -- - - Fax: 7161280-5989

0 1993. Applied Image. Inc.. All Rights R e s e ~ e d

Documents

collectionscanada.cacollectionscanada.ca/obj/s4/f2/dsk3/ftp04/mq30577.pdf · Abstract A major objective in knowledge discovery in Internet database research is to sup port exploration