12
XDSearch: an efficient search engine for XML document schemata Eric Jui-Lin Lu * , Yu-Ming Jung Department of Information Management, Chaoyang University of Technology, 168 Gifeng E. Road, Wufeng, Taichung County 413, Taiwan, ROC Abstract Electronic commerce is an emerging trade model under dramatically rapid development. So far, enormous numbers of business transactions have been conducted over the Internet. It is believed that extensible markup language (XML) is the best layout format for exchanging messages over the Internet. Since XML developers can define their own elements, it is common that various elements may be used to illustrate the same thing or one element name is used to describe different things. This makes it extremely difficult to exchange XML documents among businesses, not to mention redundant investments in the design of XML documents. If a business can obtain a document schema similar to the one that is currently being used and modify the schema to fit its needs, then not only can the development costs be reduced, but also the redundancy in the design of XML documents can be saved. Furthermore, the difficulty in data interchanges among trading partners can be alleviated. To solve the problems, many well-known international organizations have joined forces to develop XML repositories in the hope of increasing reusability of collected document schemata. Unfortunately, there is scarcely any efficient search mechanism provided for these XML repositories. In this paper, by taking advantage of the concept of ontology and the neural network techniques, we shall propose and implement a search engine, called XDSearch, for XML document schemata. XDSearch allows developers to easily and quickly locate document schemata in an XML repository as close to what they need as possible. q 2002 Elsevier Science Ltd. All rights reserved. Keywords: Extensible markup language; Search engine; Neural network; Ontology; XML repository 1. Introduction Electronic commerce is an emerging trade model under rapid development. Up to now, enormous amounts of business transactions have been processed over the Internet. To conduct transactions on the networks, enterprises must be able to exchange messages efficiently. It is believed that extensible markup language (XML) is the best layout format for exchanging messages over the Internet (Ciancarini, Vitali, & Mascolo, 1999; Fernandez, Tan, & Suciu, 2000; Lu, Chou, & Tsai, 2001; Webber, 1998). Since XML is a meta-language, developers can define tags and attributes to fit their own needs. Fig. 1 shows an example purchase order encoded in XML. As shown in Fig. 1, it is obvious that the purchase order, numbered as 12345678, is dated on 09/12/1999 and contains two product items. Its vendor’s name is ‘Executive Office Supplies’. To describe or constrain the logical structure of XML documents, developers in general use schemata to validate the contents of XML documents (Garofalakis, Gionis, Rastogi, Seshadri, & Shim, 2000; Moh, Lim, & Ng, 2000). An XML document schema specifies what tags are allowed, in what order they should be, what attributes may appear in specific tags, and what tags may be included in other tags in an XML document. Fig. 2 is an example of document type definition (DTD) associated with the purchase order shown in Fig. 1. Since XML allows developers to define their own elements and attributes, it is common that various element names may be used to illustrate one thing or one element name is used to describe various things (we shall call it ‘synonymy and polysemy’ problem hereinafter). This makes it extremely difficult to exchange XML documents among businesses. Additionally, the investments in the design of XML documents may be redundant because similar documents are likely to be redefined by many companies. Moreover, when the number of types of business documents becomes large, the design of XML documents will become a huge burden to the information technology (IT) department. To resolve the problems, it is believed that XML repositories (or registries) have to be established so that 0957-4174/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved. PII: S0957-4174(02)00150-1 Expert Systems with Applications 24 (2003) 213–224 www.elsevier.com/locate/eswa * Corresponding author. Fax: þ 886-437-42337. E-mail address: [email protected] (E.J.L. Lu).

XDSearch: an efficient search engine for XML document schemata

Embed Size (px)

Citation preview

Page 1: XDSearch: an efficient search engine for XML document schemata

XDSearch: an efficient search engine for XML document schemata

Eric Jui-Lin Lu*, Yu-Ming Jung

Department of Information Management, Chaoyang University of Technology, 168 Gifeng E. Road, Wufeng,

Taichung County 413, Taiwan, ROC

Abstract

Electronic commerce is an emerging trade model under dramatically rapid development. So far, enormous numbers of business

transactions have been conducted over the Internet. It is believed that extensible markup language (XML) is the best layout format for

exchanging messages over the Internet. Since XML developers can define their own elements, it is common that various elements may be

used to illustrate the same thing or one element name is used to describe different things. This makes it extremely difficult to exchange XML

documents among businesses, not to mention redundant investments in the design of XML documents. If a business can obtain a document

schema similar to the one that is currently being used and modify the schema to fit its needs, then not only can the development costs be

reduced, but also the redundancy in the design of XML documents can be saved. Furthermore, the difficulty in data interchanges among

trading partners can be alleviated. To solve the problems, many well-known international organizations have joined forces to develop XML

repositories in the hope of increasing reusability of collected document schemata. Unfortunately, there is scarcely any efficient search

mechanism provided for these XML repositories. In this paper, by taking advantage of the concept of ontology and the neural network

techniques, we shall propose and implement a search engine, called XDSearch, for XML document schemata. XDSearch allows developers

to easily and quickly locate document schemata in an XML repository as close to what they need as possible.

q 2002 Elsevier Science Ltd. All rights reserved.

Keywords: Extensible markup language; Search engine; Neural network; Ontology; XML repository

1. Introduction

Electronic commerce is an emerging trade model under

rapid development. Up to now, enormous amounts of

business transactions have been processed over the

Internet. To conduct transactions on the networks,

enterprises must be able to exchange messages efficiently.

It is believed that extensible markup language (XML) is

the best layout format for exchanging messages over the

Internet (Ciancarini, Vitali, & Mascolo, 1999; Fernandez,

Tan, & Suciu, 2000; Lu, Chou, & Tsai, 2001; Webber,

1998). Since XML is a meta-language, developers can

define tags and attributes to fit their own needs. Fig. 1

shows an example purchase order encoded in XML. As

shown in Fig. 1, it is obvious that the purchase order,

numbered as 12345678, is dated on 09/12/1999 and

contains two product items. Its vendor’s name is

‘Executive Office Supplies’. To describe or constrain the

logical structure of XML documents, developers in

general use schemata to validate the contents of XML

documents (Garofalakis, Gionis, Rastogi, Seshadri, &

Shim, 2000; Moh, Lim, & Ng, 2000). An XML document

schema specifies what tags are allowed, in what order they

should be, what attributes may appear in specific tags, and

what tags may be included in other tags in an XML

document. Fig. 2 is an example of document type

definition (DTD) associated with the purchase order

shown in Fig. 1.

Since XML allows developers to define their own

elements and attributes, it is common that various element

names may be used to illustrate one thing or one element

name is used to describe various things (we shall call it

‘synonymy and polysemy’ problem hereinafter). This makes

it extremely difficult to exchange XML documents among

businesses. Additionally, the investments in the design of

XML documents may be redundant because similar

documents are likely to be redefined by many companies.

Moreover, when the number of types of business documents

becomes large, the design of XML documents will become a

huge burden to the information technology (IT) department.

To resolve the problems, it is believed that XML

repositories (or registries) have to be established so that

0957-4174/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved.

PII: S0 95 7 -4 17 4 (0 2) 00 1 50 -1

Expert Systems with Applications 24 (2003) 213–224

www.elsevier.com/locate/eswa

* Corresponding author. Fax: þ886-437-42337.

E-mail address: [email protected] (E.J.L. Lu).

Page 2: XDSearch: an efficient search engine for XML document schemata

re-usable document schemata and entities, such as XML

processing rules and utilities, can be registered, and later

developers can reuse the registered schemata (Kotok, 2002).

By making use of existing schemata, the development costs

can be reduced, the redundancy in the design of XML

documents can be saved, and interoperability can be

improved. Furthermore, the difficulty in data interchange

among trading partners can be alleviated (Iacovou,

Benbasat, & Dexter, 1995; Lu & Hwang, 2001). As a

result, many internationally well-known organizations have

joined forces to develop XML repositories in projects such

as ebXML, BizTalk, etc. Even US General Accounting

Office, in its recently published report entitled ‘Electronic

Government: Challenges to Effective Adoption of the

Extensible Markup Language’, recommended establish-

ment of XML repositories to speed up XML development.

Once a repository was established, the number of

registered entities would become bigger and bigger. There-

fore, a fast and user-friendly intelligent search engine will

be imperative for XML repositories (Kotok, 1999; Kotsakis,

2002; Kotsakis & Bohm, 2000). For data-centric appli-

cations in electronic commerce, a search engine for XML

repositories should at least fulfill the following requirements

(Kobayashi & Takeda, 2000).

Precision. Precision means that the search results

returned by a document schema search engine should

conform to user’s requests as precisely as possible. Also, the

order of elements is not important because, as long as

the values of elements a, b, and c can be retrieved, a schema

in a form like (a, b, c ) is equally as good as a schema in a

form like (a, c, b ) for data-centric applications (Bourret,

Bornhovd, & Buchmann, 2000; Shanmugasundaram, Tufte,

& He, 1999).

Concision. Concision implies that, for document sche-

mata of the same precision, the briefest document schemata

are better than those longer schemata. For example, to

search document schemata which contain a, b, and c, the

ranking of (a, b, c ) should be higher than (a, (b, c )?)

because documents with complex structure is harder to

process than those with simple structure. Since it is

extremely difficult to measure recall rate and since most

web users tend to view search results without hitting ‘next

page’ buttons (Kobayashi & Takeda, 2000), search results

should be sorted by the degree of concision. The more

concise the schemata, the higher rank they are.

Consideration of the synonymy and polysemy problem.

Due to the synonymy and polysemy problem, it is likely

that, although kupricel, which means unit price, exists in a

repository, its search engine cannot locate kupricel for a user

if the user requests for ‘unit_price’. Therefore, the design of

a search engine for document schemata should take into

account the synonymy and polysemy problem.

Speed. The search for document schemata should be as

fast as possible.

Maintainability. A good search engine should be easy to

maintain.

Currently, there are two major approaches for search

engines: keyword search and directory search (Filman &

Pant, 1998; Gudivada, Raghavan, Grosky, & Kasanagottu,

1997; Kobayashi & Takeda, 2000). The keyword search

uses keywords to create indices on the collected documents.

Users search for required documents by providing key-

words. Because XML document schemata such as DTD

contain special operators such as ‘ p ’, ‘ þ ’, ‘?’, and ‘l’, it is

not suitable for XML repositories. The directory search uses

classification methods to classify similar documents into

categories, compare the similarity between collected

documents with user’s queries, and then present search

results. Conventional classification methods are based on

keywords, phrases, links, etc. and thus are not suitable for

XML repositories. Kotsakis (2002) and Kotsakis and Bohm

(2000) proposed an XML schema directory (XSD) which

merges similar document schemata into categories. The

similarity of document schemata is measured by using

Zhang and Shasha’s algorithm to calculate editing distances

between any two ordered trees (Zhang & Shasha, 1989).

Although XSD is shown to be fast and guarantees 100%

accuracy, XSD does not take into account the synonymy and

polysemy problem. Also, as stated earlier, the order of

elements in XML documents is not important for data-

centric applications. Thus, it is not appropriate to use Zhang

and Shasha’s algorithm to measure the similarity between

two schemata. Note that, it was proven that the calculation

of editing distance between two unordered trees is

NP-complete (Zhang, Statman, & Shasha, 1992). Therefore,

Fig. 1. Simple XML for a purchase order.

Fig. 2. An example of DTD.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224214

Page 3: XDSearch: an efficient search engine for XML document schemata

overcoming these problems and developing an efficient

document schema search engine is an important topic of

research that demands immediate attention.

There are several XML schema languages, such as DTD,

XML Schema, and external data representation (XDR), to

describe XML document structure. Among these schema

languages, DTD has been adopted by XML since it was

released. Because DTD is simple and has been widely

adopted in a variety of industries, in this paper, we design

and implement a search engine, called XDSearch, for

document schemata written in DTD. In XDSearch, to

resolve the synonymy and polysemy problem, the concept

of ontology is deployed. Then, the similarity between two

document schemata are measured by their Hamming

distance. Since a bias r can be specified by users, the

collected schemata with Hamming distances less than or

equal to r will be returned by XDSearch. With r ¼ 0,

XDSearch guarantees 100% accuracy. Additionally, the

ranking of search results is determined by minimum

description length (MDL) (Rissanen, 1978) which is used

to measure the concision of a document schema. Because

the time complexity of calculating Hamming distance is

high, we propose a 2-CC4 neural network, which is adapted

from the CC4 neural network, to speed up the search in this

paper. In this research, we design XDSearch with

modularity in mind such that any change to a module can

be done without affecting other modules. Therefore,

XDSearch proposed in this paper completely fulfills the

above requirements.

This paper is organized as follows. The current

development of search engine technology will be briefly

reviewed in Section 2. The proposed search engine for XML

document schemata, XDSearch, will be thoroughly

described in Section 3. In Section 4, the experimental

results of XDSearch is studied and analyzed. Finally, our

conclusion will be made in Section 5.

2. Related works

Currently, there are two major approaches for search

engines:keywordsearchanddirectorysearch (Filman&Pant,

1998; Gudivada et al., 1997; Kobayashi & Takeda, 2000).

2.1. Keyword search

In general, keyword search engines use crawlers or

robots to discover Web resources and store collected data in

server-side database systems. Search engines then create

indices on the collected data to provide fast lookup services

for users (Filman & Pant, 1998; Shu & Kak, 1999). The

most famous search engines of this type are AltaVista,

WebCrawler, Excite, and Infoseek.

There are two obvious approaches of using keyword

search on XML repositories. One approach is to index

the names of document types such as ‘purchase order’.

Users can then locate all document schemata for purchase

orders. However, this approach has some serious limi-

tations. Firstly, it is hard to find an effective ranking

algorithm for it since every schema is a purchase order.

Secondly, users may be flooded with search results if there

are a large number of document schemata for purchase

orders. Moreover, even if there exists some schemata which

are very close to users’ needs, they cannot be retrieved since

they are not categorized as purchase order.

The other approach is to index element and attribute

names of document schemata such as kDatel as shown in

Fig. 2. Users can then search for schemata that contain

queried keywords. For example, by specifying both

‘Purchase_Order’ and ‘Date’, all document schemata

containing both ‘Purchase_Order’ and ‘Date’ are retrieved.

Note that the search results will also include those schemata

that both ‘Purchase_Order’ and ‘Date’ are not their element

or attribute names, but are actually the default values of

some elements or attributes. Also, because XML document

schemata such as DTD contain special operators such as

‘ p ’, ‘ þ ’, ‘?’, and ‘l’, and because keyword search engines

use keywords to lookup the contents of schemata, they

cannot specify query like (a, (b? l c þ )). Furthermore, due

to the synonymy and polysemy problem, it is likely that,

although kupricel, which means unit price, exists in a

repository, its search engine cannot locate kupricel for a user

if the user requests for unit_price.

2.2. Subject directory search

In general, directory search engines use classification

methods to classify all collected data into a hierarchy and

allow users to search for the desired web pages by either

entering a query or navigating the hierarchy (Kobayashi &

Takeda, 2000; Szuprowicz, 1997). The most famous

search engines of this type are Yahoo, Infoseek, Excite,

and Lycos.

Conventional classification methods are based on key-

words, phrases, links, etc. With chosen classifications,

similar documents are grouped together. Search engines will

then retrieve documents that are similar to a user’s query.

However, conventional classification methods are not

suitable for XML repositories because they do not take

into account the logical structure of document schemata.

Thus, Kotsakis (2002) and Kotsakis and Bohm (2000)

proposed an XML Schema Directory (XSD). XSD aggre-

gates similar document schemata into merger schemata and

then merges similar merger schemata into higher-level

schemata. The similarity of document schemata is measured

by editing distance between two trees. Because an XML

schema is actually a tree, the distance between any two

schemata is based on the number of edit operations (such as

insertion, deletion, and substitution) that need to be

performed to convert a tree into another. The smaller the

distance between two schemata, the more similar they are.

XSD guarantees 100% accuracy and is shown to be fast

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 215

Page 4: XDSearch: an efficient search engine for XML document schemata

because it deploys a very fast algorithm, proposed by Zhang

and Shasha (1989), to calculate the minimum editing

distance between two ordered trees. However, XSD has

several major drawbacks. First of all, as stated earlier, the

order of elements in XML documents is not important for

data-centric applications. A schema in a form like (a, b, c ) is

as good as a schema in a form like (c, a, b ). As a result, it is

inappropriate to measure similarity between schemata by

using algorithms that calculate editing distance between two

ordered trees. Note that, it was proven that the calculation of

editing distance between any two unordered trees is NP-

complete (Zhang et al., 1992). Also, by treating schemata

with identical elements but in different order as different

schemata, the size of XSD grows rapidly. Furthermore,

XSD does not take into account the synonymy and

polysemy problem. Thus, it is likely that, in XSD, two

identical schemata, except for their root elements, are

treated as two different schemata. This will make XML

Schema Directory unnecessarily huge.

3. XDSearch: a search engine for XML document

schemata

From the earlier discussions, it is clear that there is a

strong need to propose a new measurement to compare the

similarity between two XML schemata and a new ranking

algorithm for XML schemata search engine, because editing

distance is not an appropriate way to measure the similarity

between two schemata. Garofalakis et al. (2000) stated that

an XML document can be described by more than one DTD,

and that a well-defined DTD should have two properties:

precision and concision. Precision means that the DTD

should precisely conform to the XML document. Concision

means that the DTD should be trim and streamlined. This

research will utilize and extend these two properties in the

design of XDSearch.

Precision. Let U denote a user’s request, DSi represent a

document schema in an XML repository, and f ðU;DSiÞ be

the differences between U and DSi. If f ðU;DSiÞ ¼ 0; DSi is

identical to U. For example, assume that there are three

known document schemata: DS1: (a, b ), DS2: (a, b, c ), and

DS3: (a, b, c, d ) where a, b, c, and d represent different

elements. If U is (a, b, c ), then f ðU;DS2Þ ¼ 0: In other

words, DS2 precisely matches U. Note that if U is (b, c, d ),

then none of DSi satisfies U. To increase flexibility, a radius,

denoted as r, is adopted to handle this situation. The radius

is a bias that the user can tolerate. In this example, if r is set

to be 2, both DS2 and DS3 will then satisfy U. In this

research, we use the Hamming distance (HD), denoted as

HD(x, y ), to measure the differences between vectors x and

y. Let DS1 be (1,1,0,0), DS2 be (1,1,1,0), DS3 be (1,1,1,1),

and U be (0,1,1,1). Then HD(U, DSi) are 3, 2, and 1,

respectively.

Concision. Concision means that a document schema

should be trim and streamlined. Suppose that we have XML

documents of {ab, abab, ababab} where a and b represent

different elements. There are at least two possible candidate

DTDs for these XML documents: (1) ðabÞp; (2) ablab

ðablababÞ: It is quite obvious that the first DTD is more

concise than the second one. In this research, we use MDL

to measure concision and determine the ranking order of

search results (Garofalakis et al., 2000; Rissanen, 1978).

The calculation of MDL involves the following rules:

1. Each element is counted one, but the root element is not

counted.

2. Each meta-character such as l, p , þ , ?, (, and) is counted

one.

3. The length of any substructure has to be counted.

In the example shown in Fig. 2, this DTD has a root

element called Purchase_Order. According to the first rule,

the Purchase_Order is not counted. Since there are four

elements, the MDL of the example DTD is four based on the

first rule again. Because the element ‘Detail’ has a meta-

character ‘ p ’, according to the second rule, the MDL is

now five. Note that there is a substructure Detail in the DTD

which is of length 3. Therefore, the total value of the MDL

of the example DTD is eight (i.e. MDL(DTD) ¼ 8).

If users do not specify any constraint on the queried

document structure, the document schemata with smaller

MDL should be ranked higher. However, if there is any

constraint imposed on the queried document structure, the

search results will be sorted in descendent order based on

the value of lMDL(U ) 2 MDL(DSi)l. For example, assume

that there are three candidate DTDs: DTD1: (a, b, c ), DTD2:

(a, b p , c ), and DTD3: (a, b, c p ). If U is (a, b p , c ), the

values of lMDL(U ) 2 MDL(DTDi)l are 1, 0, and 0;

respectively. Thus, the order of the search result is DS2,

DS3, DS1.

XDSearch is composed of the information component,

search component, and interface component, and its

architecture is shown in Fig. 3.

3.1. Information component

The information component collects and provides data

for the search engine to work on. These data can be stored in

various formats such as flat files, relational databases, or

even native XML databases. In XDSearch, all data are saved

in a relational database except for DTD files. The

information component is composed of a document schema

repository, a term table, and schema tables.

3.1.1. Document schema repository

The document schema repository consists of XML

document schemata described in either DTDs, XML

Schemas, or XDR. In the current design of XDSearch,

only DTD is considered. Thus, the document schema

repository is also called DTD repository. Each DTD file is

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224216

Page 5: XDSearch: an efficient search engine for XML document schemata

actually stored in the DTD repository and is accessible

through the Internet.

3.1.2. Term table

As described earlier, since XML is a meta-language,

developers can define their own tags, and this results in the

synonymy and polysemy problem. To resolve the synonymy

and polysemy problem, the concept of ontology has to be

incorporated into the design of search engines for XML

document schemata.

An ontology generally contains a hierarchy of a set of

objects within a domain and describes the relationships

among these objects (Chandrasekaran, Josephson, &

Richard Benjamins, 1999; Decker et al., 2000). For

example, business documents may contain fields of ‘order

quantity’ and ‘number of return’. An ontology may be used

to illustrate them as ‘qty þ ’ meaning the inventory stock is

increased. In other words, both order quantity and ‘number

of return’ are mapped to ‘qty þ ’, and this can bee seen in

Fig. 4. Additionally, the terms used in the term table can be

further clustered into categories. For example, in Fig. 4,

‘qty þ ’, ‘qty 2 ’, and ‘money’ are clustered into the

category named ‘numeric’ which is also a term name.

In XDSearch, all element and attribute names are saved

in the field ‘Tag_Name’ of the table ‘DTD_Detail’ as shown

in Fig. 5(a). Each ‘Tag_Name’ is associated with (a)

a specific DTD file which is referenced by the ‘DTD_No’

that in turn points to the URL of the DTD file and (b) a term

name which is referenced by the ‘Term_No’ to which it

belongs. Note that the term table also indicates to which

category a term belongs. Because a category name is

actually a term name, we have a recursive relationship from

the ‘Category_No’ to ‘Term_No’ in the term table. The

terms used in the term table must be unique. To store

the relationships shown in Fig. 4, the tables are created as

shown in Fig. 5(b).

3.1.3. Schema tables

The schema tables contain the brief descriptions, the

URLs, and the MDL values of registered document

schemata as well as all the element and attribute names in

these schemata. In XDSearch, the schema tables include

both ‘DTD’ and ‘DTD_Detail’ tables, as shown in Fig. 5(a).

For data-centric applications, what they concerned are

the data that contained in XML documents. Therefore, a

schema in a form like (a, b, c ) is as good as a schema in a

form like (c, b, a ) (Bourret et al., 2000; Shanmugasundaram

et al., 1999). As a result, XDSearch shall retrieve all

registered schemata as long as their element names match

user’s requests. Additionally, based on Lu and Hwang

(2001) observations, attributes can be treated as elements

without losing any information for data-centric applications.

Thus, attributes, like elements, are treated as terms in

XDSearch. For example, for the element ‘Qty’ in Fig. 2,

there is an attribute ‘unit’. In XDSearch, both ‘Qty’ and

‘unit’ are saved in the ‘DTD_Detail’ table, and thus the

MDL value of the DTD is incremented by two, not one.

3.2. Interface component

The goal of the interface component is to facilitate the

use and management of the search engine. The management

Fig. 3. The architecture of XDSearch.

Fig. 4. Ontology.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 217

Page 6: XDSearch: an efficient search engine for XML document schemata

interface is to provide system managers an easy-to-use

interface for maintaining XDSearch. In the user interface,

users may select terms from different categories. Fig. 6

shows an example interface we develop for the XDSearch.

A selected term can be configured as the child element of the

root element or any selected element. For example, as

shown in Fig. 6, both ‘Item_No’ and ‘Price’ are the sub-

elements of ‘Items’. The radius of HD can also be chosen by

users. Additionally, each selected term can be associated

with a meta-character such as p , þ , ?, or l by clicking

either on button ‘Has ( p , þ ,?)’ or on button ‘Has (l)’. For

example, ‘Items’ in Fig. 6 is associated with a meta-

character ‘ þ ’.

3.3. Search component

The search component is the kernel of XDSearch.

There are three modules in the search component: a

comparison module, a ranking module, and a mainten-

ance module.

3.3.1. Comparison module

The comparison module is to compare Us with DSi in the

document schema repository. In XDSearch, HD is used for

measuring the differences between two DTDs. Initially, all

DTDs, including Us and DTDi, are turned into binary

vectors of length n where n is the number of terms in the

term table. According to the sequence of terms in the term

table, each element of the vectors is set to be either 1 or 0. If

an element of a DTD is associated with the ith element of

the vector, the ith element of the vector is set to be 1;

otherwise it is set to be 0. For example, if the length of the

term table is nine and a purchase order has six terms, this

purchase order is transformed into a vector of length nine,

and the vector has six elements of 1 as shown in Fig. 7. After

the comparisons, all the DTDs with HD(U, DTDi) , r are

included in the search results. Note that, with r ¼ 0,

XDSearch guarantees 100% accuracy.

The algorithm of calculating HD is shown in Fig. 8.

Assume that the number of DTDs in the DTD table is m and

the number of terms in each DTD is p. From Fig. 8, it is easy

to see that the time complexity of calculating HD is O(mp 2).

In the following sections, we will study the CC4 network

and design a 2-CC4 network to speed up the calculation.

CC4 neural network. The CC4 algorithm, proposed by

Kak and Tang, is a corner classification training algorithm

for three-layered feed-forward neural networks (Kak, 1993;

Tang & Kak, 1998). The architecture of CC4 is shown in

Fig. 9. The input and output data are all binary vectors, and

the weights are all integers. The number of input neurons is

the length of the input vector plus one. The additional

neuron is the bias neuron with a constant input of 1.

Fig. 8. Algorithm of HD.

Fig. 7. Conversion of a DTD to a vector.

Fig. 5. Ontology described in RDBMS.

Fig. 6. An example DTD query.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224218

Page 7: XDSearch: an efficient search engine for XML document schemata

The number of neurons in the hidden layer is the number of

training patterns, and each hidden neuron represents one

training pattern. The number of output neurons depends

upon the user’s requirement. The connections between the

input layer and the hidden layer as well as between the

hidden layer and the output layer are fully connected.

The training process of the CC4 neural network is

described as follows. The weight of the connection from the

input neuron i, where i ¼ 1,2,…,n, to the hidden neuron j,

where j ¼ 1,2,…,H, is denoted as wij, and Xji is the value of

ith element of the jth training pattern (or training vector).

The value of wij is determined by Eq. (1) where r is the user-

defined radius and s is the number of 1’s in the training

vector.

wij ¼

1; Xji ¼ 1

21; Xji ¼ 0

r 2 s þ 1; i ¼ n

8>><>>: ð1Þ

For the jth neuron of the hidden layer, its activity function is

defined as netj ¼P

iXiwij: The transfer function f(netj),

which is the output value of the jth neuron of the hidden

layer, is shown in Eq. (2).

Hj ¼ f ðnetjÞ ¼1; netj . 0

0; netj # 0

(ð2Þ

The weight of the connection from the hidden neuron j to the

output neuron k is denoted as ujk, and Yjk is the expected

value of the kth output neuron for the jth training pattern.

The weights of the hidden layer to the output layer are

assigned according to Eq. (3).

wjk ¼1; Y

jk ¼ 1

21; Yjk ¼ 0

8<: ð3Þ

The activity function of the output neuron k is defined as

netk ¼P

kHjwjk: The transfer function f(netk), which is the

output value of the kth neuron of the output layer, is shown

in Eq. (4).

Yk ¼ f ðnetkÞ ¼1; netk . 0

0; netk # 0

(ð4Þ

The CC4 network is fast, requires only one-time training,

and can determine which DTD fulfills the user’s demand.

However, because CC4 is a supervised network, the

expected output values must be given in the training

process. To be able to query with different radiuses, the CC4

network has to be trained with all possible radiuses.

Therefore, it is impractical to use the CC4 network for

querying document schemata.

The 2-CC4 Algorithm. By modifying the CC4 network,

we designed a new neural network. The modified CC4

network, called 2-CC4, is a two-layered CC4 network. The

second half of the original CC4 network is cut off.

The architecture of the 2-CC4 network is shown in Fig. 10.

The input layer and output layer of the 2-CC4 network are

identical to the input layer and the hidden layer of the

original CC4 network, respectively. The connections

between the input layer and the output layer of the new

network is also fully connected.

The weights of the input layer to the output layer are

assigned according to Eq. (5). The activity function and the

transfer function of the jth output neuron are shown in Eqs.

(6) and (7), respectively. Because the radius can be

dynamically determined by the user, r is now associated

with the calculation of the activity function, as shown in

Eq. (6).

wij ¼

1; Xji ¼ 1

21; Xji ¼ 0

1 2 s; i ¼ n

8>><>>: ð5Þ

netj ¼X

i

Xiwij þ r ð6Þ

Yj ¼ f ðnetjÞ ¼netj; netj . 0

0; netj # 0

(ð7Þ

Fig. 9. Architecture of CC4 neural network.

Fig. 10. Architecture of 2-CC4 network.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 219

Page 8: XDSearch: an efficient search engine for XML document schemata

Suppose that there are three training vectors: (1,1,1), (0,0,0),

and (0,0,1). We can then obtain the weights of the input

layer to the output layer as shown in Table 1. Providing

inputs of U1 whose input vector is (1,1,1) and r ¼ 0, we

obtain that the output vector netU1

j is (1,0,0), and this

indicates that the first training vector matches U1. Similarly,

if U2: {(1,0,1); r ¼ 0} and U3: {(1,0,1); r ¼ 1}, the output

vectors netU2

j and netU3

j are (0,0,0) and(1,0,1), respectively.

In other words, there is no training vector that matches U2.

However, if we tolerate a bias of one, both the first and the

third training vectors match (1,0,1).

The algorithm of 2-CC4 can be divided into two parts:

learning and testing as shown in Fig. 11. The learning

algorithm will be executed only when the DTD table (or the

term table) is updated. The testing algorithm will be

executed whenever a search query is requested. Assume

that the number of DTDs in the DTD table is m and the

number of terms in the term table is n. As shown in Fig. 11,

the time complexity of the testing algorithm is O(mn ).

When n is smaller than p 2, 2-CC4 is better than HD.

3.3.2. Ranking module

The ranking module is to evaluate those DSi with HD(U,

DSi) , r, denoted as DSri ; and to sort DSr

i based on HD(U,

DSir) and MDLðDSr

i Þ: The ranking module can be divided

into two parts according to their functions: calculation and

sorting.

Calculation. The length required for describing DSri is

measured by MDL as described earlier. In the calculation

function, the MDL for each DSri ; denoted as MDLðDSr

i Þ; is

calculated.

Sorting. The sorting function sorts DSri based on

HD(U, DSir) and MDLðDSr

i Þ: The major sorting criterion

is HD(U, DSir). Firstly, the sorting function sorts DSr

i in

ascending order based on the value of HD(U, DSir). For

those DSri with identical HD(U, DSi

r), the ordering is

then based on MDLðDSri Þ: If a user does not specify any

constraint on the logical structure of U, DSir is sorted in

ascending order based on the value of MDLðDSri Þ;

otherwise, DSri is sorted in ascending order based on

the value of lMDLðUÞ2 MDLðDSri Þl:

3.3.3. Maintenance module

This module accepts requests from the management

interface to perform the requested tasks. All fundamental

maintenance functions such as insertion, deletion, and

modification are included in this module. The major tasks

accomplished by the maintenance module are to assure the

accuracy and consistency of the term table and the schema

tables.

4. Experiments

For interoperability, XDSearch was developed in Java.

Microsoft Access was used to store terms and schema

information except for DTD files. Jigsaw, a Java Servlet

server, was used as the servlet engine and the Web server. All

our experiments were executed on a PC with an Intel Celeron

300 processor and 256 Megabytes of main memory. The

operating system installed on the PC is Microsoft Windows

2000 professional. All programs were developed in JDK 1.2.

4.1. Performance comparisons

Lu et al. (2001) defined 12 DTDs for Taiwan’s flower

distribution channel (Lu et al., 2001). The smallest and

largest numbers of elements and attributes in these DTDs

are 20 and 43, respectively, and the average number is 32.

Therefore, we design our experiments such that the number

of terms in each DTD is between 20 and 40. Each

Fig. 11. 2-CC4 algorithms.

Table 2

Training and search time for 2-CC4 and HD (in ms) with p ¼ 30 and

n ¼ 100

No. of DTDs 2-CC4 HD

Training time Search time Search time

1000 10 0.010 371 771

2000 20 0.010 811 1552

3000 30 0.010 1212 2323

4000 40 0.010 1612 3094

5000 50 0.010 2033 3865

6000 60 0.010 2414 4646

7000 71 0.010 2835 5417

8000 90 0.011 3225 6179

9000 100 0.011 3635 6970

10,000 110 0.011 4016 7751

Table 1

Realization of schema matching using 2-CC4

Training

vectors

s Weights netðU1

j Þ netðU2Þj net

ðU3Þj

1 1 1 3 1 1 1 22 1 0 1

0 0 0 0 21 21 21 1 0 0 0

0 0 1 1 21 21 1 0 0 0 1

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224220

Page 9: XDSearch: an efficient search engine for XML document schemata

experiment is conducted in various configurations. The

numbers of terms in the term table are 50, 100, 150, and 200.

The numbers of DTDs in the DTD table are within the range

between 1000 and 10,000. Because the search time for both

2-CC4 and HD is close to zero milliseconds (ms) when there

is only one request, the execution time of each test is

measured by running one request 100 times. The contents of

all DTDs and the user request U are randomly generated

from the terms in the term table.

In the first experiment, all DTDs in the DTD table and the

user request are generated at random. The number of terms

for all DTDs in this experiment is 30 (i.e. p ¼ 30). Table 2

lists the experimental results when the length of the term

table is 100 (i.e. n ¼ 100). In the table, column one indicates

the number of DTDs in the DTD table which is denoted as

m, column two is the training time required for 2-CC4, and

column three is the average training time for each DTD. As

shown in the table, the average training time for each DTD

is about 0.01 ms, and the training time grows linearly as m

increases. Also as shown in the table, the search time of 2-

CC4 is faster than HD. When m increases while both n and p

remain fixed, the search times for both 2-CC4 and HD are

increased linearly. This is because both the time complex-

ities of 2-CC4 and HD are directly proportional to m.

Additionally, the training time of 2-CC4 is quite short. This

makes it possible to retrain the 2-CC4 network online

without interrupting services.

In the second experiment, we intend to measure the

influence on search time when n is varied. Table 3 shows the

experimental results when p ¼ 30. From the table, it is easy

to tell that the search time of 2-CC4 increases as n grows.

However, the increases in the search time of HD are

relatively insignificant as n is varied. For m ¼ 5000, as n

grows from 50 to 100, 150, and 200, the search time for

2-CC4 increases by about 41% ((2193 2 1553)/1553), 79%

((2774 2 1553)/1553), and 101% ((3115 2 1553)/1553),

respectively. Similarly, for HD, the search time increases by

about 6, 9, and 11%. This is because the time complexity of

HD has nothing to do with n.

The third experiment is designed to find out the influence

on search time when p is varied. Table 4 lists the

experimental results when n ¼ 100. For m ¼ 5000, as p

grows from 20 to 25, 30, 35, and 40, the search time for HD

increases by about 32% ((11,597 2 8793)/8793), 79%

((15,613 2 8793)/8793), 133% ((20,460 2 8793)/8793),

and 189% ((25,416 2 8793)/8793), respectively. Similarly,

for 2-CC4, the search time increases by about 20, 29, 43,

and 53%. Therefore, it is obvious that the search time

required by HD increases much faster than that required by

2-CC4 as p grows.

All the experiments described so far are in the worst case

because we assume that all DTDs in the DTD table have

identical number of terms (i.e. p ¼ 30 for all experiments

conducted so far). However, the DTDs in the DTD table in

general do not have identical number of terms. Therefore, in

the fourth and fifth experiments, the contents and numbers

of all DTDs in the DTD table are generated randomly. In the

fourth experiment, we intend to find out the influence on

search time when the radius is varied. Table 5 shows the

experimental results when m ¼ 1000 and n ¼ 100. Accord-

ing to the table, it is obvious that 2-CC4 is at least 3.8 times

faster than HD.

Table 6 shows the experimental results of the fifth

experiment when n ¼ 100 and the number of terms of U is

30. With the most commonly chosen radiuses (i.e. r is either

0, 1, 2, or 3) when the number of terms of U is set to be 30

(which is the average number of terms found in Taiwan’s

flower distribution channel), 2-CC4 is still faster than HD.

Table 3

Search time for 2-CC4 and HD (in ms) with p ¼ 30

Number of DTD Number of terms in the term table

50 100 150 200

2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD

1000 290 2925 410 3085 511 3165 591 3185

5000 1553 14,571 2193 15,512 2774 15,913 3115 16,183

10,000 3055 29,562 4357 32,557 5568 33,358 6279 33,549

Table 4

Search time for 2-CC4 and HD (in ms) with n ¼ 100

Number of DTDs Number of terms of U

20 25 30 35 40

2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD

1000 341 1743 380 2304 410 3064 491 4016 521 5027

5000 1762 8793 2123 11,597 2273 15,613 2523 20,460 2694 25,416

10,000 3535 17,635 4216 24,375 4526 32,567 4978 41,520 5438 51,384

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 221

Page 10: XDSearch: an efficient search engine for XML document schemata

In fact, all experimental results indicate that 2-CC4 is far

superior to HD.

4.2. Discussions

From the previous discussions, it is shown that

Precision. XDSearch uses Hamming distance to measure

the similarity between any two document schemata. Users

are allowed to specify the value of r. With r ¼ 0, XDSearch

guarantees 100% accuracy. When r . 0 is specified by

users, more similar document schemata will be retrieved.

Concision. MDL is deployed as the ranking algorithm for

XDSearch. If a user only knows what elements should be

included in a schema but has no idea about the structure of

the schema, DTDs with smaller MDL in the repository

should be ranked higher than other DTDs with larger MDL.

Additionally, if the user specifies constraint (such as ‘ p ’,

‘ þ ’, ‘?’, ‘l’, etc.) on document structure, DTDs with the

smaller lMDLðUÞ2 MDLðDSri Þl should be ranked higher

than other DTDs with larger lMDLðUÞ2 MDLðDSri Þl: For a

query like the one shown in Fig. 6, its search results are

shown in Fig. 12. Because the MDL of the query is 11, the

results are sorted in ascending order of lMDLðUÞ2 MDL

ðDSri Þl:

Consideration of the synonymy and polysemy problem.

The concept of ontology was incorporated into the design of

XDSearch and was implemented in the term table and the

schema tables. Therefore, document schemata for purchase

orders, which were defined as either kPuchase_Orderl,kOrdersl, or kPOl in DTD, can be found. An example is

Table 6

All DTDs are generated with a random number of terms when the length of the term table is 100

Radius 0 1 2 3

No. of DTD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD

1000 40 260 121 731 190 1241 290 2043

2000 141 560 351 1622 480 2644 661 4196

3000 231 781 530 2533 721 3996 1001 6359

4000 310 1152 691 3194 941 5098 1312 8242

5000 411 1402 871 4076 1182 6369 1642 10,295

6000 501 1803 1051 4827 1423 7691 2023 12,659

7000 521 1853 1192 5578 1642 8953 2323 14,662

8000 661 2403 1422 6680 1933 10,575 2684 16,995

9000 721 2644 1552 7391 2143 11,797 3024 19,168

10,000 801 2955 1743 8302 2353 13,048 3305 21,290

Table 5

Search time for 2-CC4 and HD (in ms) with m ¼ 1000 and n ¼ 100

Radius Number of terms of U

20 25 30 35 40

2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD

0 50 191 40 281 40 260 60 361 51 350

1 111 510 130 651 121 731 160 991 181 1241

2 160 801 180 962 190 1241 241 1712 250 1903

3 200 992 230 1302 241 1682 301 2303 320 2614

4 240 1182 260 1553 290 2043 340 2774 361 3154

5 271 1282 290 1733 330 2324 371 3065 411 3645

6 270 1412 311 1893 351 2614 400 3404 430 3996

7 290 1483 330 2053 370 2714 420 3595 461 4186

8 290 1530 351 2093 390 2865 440 3816 470 4467

9 301 1552 351 2193 390 2935 451 3915 491 4626

Fig. 12. Example search results.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224222

Page 11: XDSearch: an efficient search engine for XML document schemata

shown in Fig. 12. Moreover, as shown in Fig. 13, elements (for

examples, ‘No’ and ‘Date’) match ‘Doc_No’ and ‘DateTime’

elements of the example query, respectively. Although the

term table and the schema tables are currently defined in

relational databases, it can be replaced by native XML

database such as Software AG’s Tamino or eXcelon’s XIS.

Speed. As shown in the experimental results, XDSearch

is really fast. In the worst case, each U needs about 4.44 ms

using 2-CC4 algorithm when the number of known DTDs is

1000 and the length of term table is 100. Also, as shown in

the first experiment, the training time is less than 110 ms.

Thus, this makes it possible to retrain the 2-CC4 network

online without interrupting services.

Maintainability. We design XDSearch with modularity

in mind such that any change to a module can be done

without affecting other modules. Moreover, we developed

several utilities to help system administrative works.

5. Conclusions and future works

In facilitating the development of the applications of

electronic commerce, great efforts have been dedicated to

the development of XML repositories. One of the core

features of XML repositories is to provide users an

intelligent search utility to locate reusable entities such as

document schemata. However, all known search engines are

not suitable for searching document schemata. In this paper,

we proposed an efficient search engine for XML document

schemata. This search engine, called XDSearch, provides

speedy and accurate search for developers to locate

document schemata. Additionally, the development of

XDSearch plus document schema extractors such as

XTRACT (Garofalakis et al., 2000) and DTD-Miner (Moh

et al., 2000) helps research in areas such as web mining,

web-based databases, etc.

Currently, this model has one major drawback: newly

collected XML schemata cannot be translated and placed in

the DTD table automatically. This translation requires

human intervention to correctly map an element with a

specific term in the ontology. It may require huge efforts in

the generation of the DTD table. However, this is a one-time

task. Once more and more DTDs are collected, what

remains would be a minor job for the system managers to

update the DTDs.

Acknowledgements

This research was supported in part by the Research

Board of Chaoyang University of Technology, Taiwan,

ROC, under contract number: 89-A016.

References

Bourret, R., Bornhovd, C., & Buchmann, A (2000). A generic load/extract

utility for data transfer between XML documents and relational

databases. Proceedings of the Second International Workshop on

Advanced Issues of E-Commerce and Web-based Information Systems

(pp. 134–143).

Chandrasekaran, B., Josephson, J. R., & Richard Benjamins, V. (1999).

What are ontologies, and why do we need them? IEEE Intelligent

Systems, 14, 20–26.

Ciancarini, P., Vitali, F., & Mascolo, C. (1999). Managing complex

documents over the WWW: a case study for XML. IEEE Transactions

on Knowledge and Data Engineering, 11, 629–638.

Decker, S., Melnik, S., Van Harmelen, F., Fensel, D., Klein, M., Broekstra,

J., Erdmann, M., & Horrocks, I. (2000). The semantic web: the roles of

XML and RDF. IEEE Internet Computing, 63–74.

Fernandez, M., Tan, W.-C., & Suciu, D. (2000). SilkRoute: trading between

relations and XML. Computer Networks, 33, 723–745.

Filman, R. E., & Pant, S. (1998). Search the Internet. IEEE Internet

Computing, 21–23.

Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000).

XTRACT: a system for extracting document type descriptors from

XML documents. SIGMOD, 29(2), 165–176.

Gudivada, V. N., Raghavan, V. V., Grosky, W. I., & Kasanagottu, R.

(1997). Information retrieval on the world wide web. IEEE Internet

Computing, 58–68.

Iacovou, C. L., Benbasat, I., & Dexter, A. (1995). Electronic data

interchange and small organizations: adoption and impact of technol-

ogy. MIS Quarterly, 465–485.

Kak, S. C. (1993). New training algorithm in feedforward neural networks.

In P. P. Wang (Ed.), Advances in fuzzy theory and technologies. First

International Conference on Fuzzy Theory and Technology, Durham,

NC, October 1992, Durham, NC: Bookwright Press.

Kobayashi, M., & Takeda, K. (2000). Information retrieval on the web.

ACM Computing Surveys, 32, 144–173.

Kotok, A (1999). White Paper on Global XML Repositories for XML/EDI.

XML/EDI Group.

Kotok, A (2002). Government and finance industry urge caution on XML.

XML.com.

Kotsakis, E. (2002). XSD: a hierarchical access method for indexing XML

schemata. Knowledge and Information Systems, 4, 168–201.

Kotsakis, E., & Bohm, K (2000). XML schema directory: a data structure

for XML data processing. Proceedings of the First International

Conference on Web Information Systems Engineering (pp. 62–69).

Lu, E. J.-L., Chou, S., & Tsai, R.-H. (2001). An empirical study of XML/

EDI. Journal of Systems and Software, 58, 269–277.

Lu, E. J.-L., & Hwang, R.-J. (2001). A distributed EDI model. Journal of

Systems and Software, 56(1), 1–7.

Moh, C. -H., Lim, E. -P., & Ng, W. -K (2000). DTD-Miner: a tool for

mining DTD from XML documents. Proceedings of Second Inter-

national Workshop on Advanced Issues of E-Commerce and Webbased

Information Systems (pp. 144–151).

Fig. 13. The DTD for P_Order.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 223

Page 12: XDSearch: an efficient search engine for XML document schemata

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14,

465–471.

Shanmugasundaram, J., Tufte, K., & He, G (1999). Relational databases for

querying XML documents: limitations and opportunities. Proceedings

of the 25th VLDB Conference (pp. 302–314).

Shu, B., & Kak, S. (1999). A neural network-based intelligent metasearch

engine. Information Sciences, 120, 1–11.

Szuprowicz, B. O. (1997). Search engine technologies for the

world wide web and Internet. Computer Technology Research

Corp, USA.

Tang, K.-W., & Kak, S. C. (1998). A new corner classification

approach to neural network training. Circuits, Systems, and Signal

Processing, 17(4), 459–469.

Webber, D. R. R. (1998). Introducing XML/EDI frameworks. Electronic

Markets, 8, 38–41.

Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between

unordered labeled trees. Information Processing Letters, 42, 133–139.

Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing

distance between trees and related problems. SIAM Journal of

Computing, 18(6), 1245–1262.

E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224224