Upload
eric-jui-lin-lu
View
219
Download
3
Embed Size (px)
Citation preview
XDSearch: an efficient search engine for XML document schemata
Eric Jui-Lin Lu*, Yu-Ming Jung
Department of Information Management, Chaoyang University of Technology, 168 Gifeng E. Road, Wufeng,
Taichung County 413, Taiwan, ROC
Abstract
Electronic commerce is an emerging trade model under dramatically rapid development. So far, enormous numbers of business
transactions have been conducted over the Internet. It is believed that extensible markup language (XML) is the best layout format for
exchanging messages over the Internet. Since XML developers can define their own elements, it is common that various elements may be
used to illustrate the same thing or one element name is used to describe different things. This makes it extremely difficult to exchange XML
documents among businesses, not to mention redundant investments in the design of XML documents. If a business can obtain a document
schema similar to the one that is currently being used and modify the schema to fit its needs, then not only can the development costs be
reduced, but also the redundancy in the design of XML documents can be saved. Furthermore, the difficulty in data interchanges among
trading partners can be alleviated. To solve the problems, many well-known international organizations have joined forces to develop XML
repositories in the hope of increasing reusability of collected document schemata. Unfortunately, there is scarcely any efficient search
mechanism provided for these XML repositories. In this paper, by taking advantage of the concept of ontology and the neural network
techniques, we shall propose and implement a search engine, called XDSearch, for XML document schemata. XDSearch allows developers
to easily and quickly locate document schemata in an XML repository as close to what they need as possible.
q 2002 Elsevier Science Ltd. All rights reserved.
Keywords: Extensible markup language; Search engine; Neural network; Ontology; XML repository
1. Introduction
Electronic commerce is an emerging trade model under
rapid development. Up to now, enormous amounts of
business transactions have been processed over the
Internet. To conduct transactions on the networks,
enterprises must be able to exchange messages efficiently.
It is believed that extensible markup language (XML) is
the best layout format for exchanging messages over the
Internet (Ciancarini, Vitali, & Mascolo, 1999; Fernandez,
Tan, & Suciu, 2000; Lu, Chou, & Tsai, 2001; Webber,
1998). Since XML is a meta-language, developers can
define tags and attributes to fit their own needs. Fig. 1
shows an example purchase order encoded in XML. As
shown in Fig. 1, it is obvious that the purchase order,
numbered as 12345678, is dated on 09/12/1999 and
contains two product items. Its vendor’s name is
‘Executive Office Supplies’. To describe or constrain the
logical structure of XML documents, developers in
general use schemata to validate the contents of XML
documents (Garofalakis, Gionis, Rastogi, Seshadri, &
Shim, 2000; Moh, Lim, & Ng, 2000). An XML document
schema specifies what tags are allowed, in what order they
should be, what attributes may appear in specific tags, and
what tags may be included in other tags in an XML
document. Fig. 2 is an example of document type
definition (DTD) associated with the purchase order
shown in Fig. 1.
Since XML allows developers to define their own
elements and attributes, it is common that various element
names may be used to illustrate one thing or one element
name is used to describe various things (we shall call it
‘synonymy and polysemy’ problem hereinafter). This makes
it extremely difficult to exchange XML documents among
businesses. Additionally, the investments in the design of
XML documents may be redundant because similar
documents are likely to be redefined by many companies.
Moreover, when the number of types of business documents
becomes large, the design of XML documents will become a
huge burden to the information technology (IT) department.
To resolve the problems, it is believed that XML
repositories (or registries) have to be established so that
0957-4174/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved.
PII: S0 95 7 -4 17 4 (0 2) 00 1 50 -1
Expert Systems with Applications 24 (2003) 213–224
www.elsevier.com/locate/eswa
* Corresponding author. Fax: þ886-437-42337.
E-mail address: [email protected] (E.J.L. Lu).
re-usable document schemata and entities, such as XML
processing rules and utilities, can be registered, and later
developers can reuse the registered schemata (Kotok, 2002).
By making use of existing schemata, the development costs
can be reduced, the redundancy in the design of XML
documents can be saved, and interoperability can be
improved. Furthermore, the difficulty in data interchange
among trading partners can be alleviated (Iacovou,
Benbasat, & Dexter, 1995; Lu & Hwang, 2001). As a
result, many internationally well-known organizations have
joined forces to develop XML repositories in projects such
as ebXML, BizTalk, etc. Even US General Accounting
Office, in its recently published report entitled ‘Electronic
Government: Challenges to Effective Adoption of the
Extensible Markup Language’, recommended establish-
ment of XML repositories to speed up XML development.
Once a repository was established, the number of
registered entities would become bigger and bigger. There-
fore, a fast and user-friendly intelligent search engine will
be imperative for XML repositories (Kotok, 1999; Kotsakis,
2002; Kotsakis & Bohm, 2000). For data-centric appli-
cations in electronic commerce, a search engine for XML
repositories should at least fulfill the following requirements
(Kobayashi & Takeda, 2000).
Precision. Precision means that the search results
returned by a document schema search engine should
conform to user’s requests as precisely as possible. Also, the
order of elements is not important because, as long as
the values of elements a, b, and c can be retrieved, a schema
in a form like (a, b, c ) is equally as good as a schema in a
form like (a, c, b ) for data-centric applications (Bourret,
Bornhovd, & Buchmann, 2000; Shanmugasundaram, Tufte,
& He, 1999).
Concision. Concision implies that, for document sche-
mata of the same precision, the briefest document schemata
are better than those longer schemata. For example, to
search document schemata which contain a, b, and c, the
ranking of (a, b, c ) should be higher than (a, (b, c )?)
because documents with complex structure is harder to
process than those with simple structure. Since it is
extremely difficult to measure recall rate and since most
web users tend to view search results without hitting ‘next
page’ buttons (Kobayashi & Takeda, 2000), search results
should be sorted by the degree of concision. The more
concise the schemata, the higher rank they are.
Consideration of the synonymy and polysemy problem.
Due to the synonymy and polysemy problem, it is likely
that, although kupricel, which means unit price, exists in a
repository, its search engine cannot locate kupricel for a user
if the user requests for ‘unit_price’. Therefore, the design of
a search engine for document schemata should take into
account the synonymy and polysemy problem.
Speed. The search for document schemata should be as
fast as possible.
Maintainability. A good search engine should be easy to
maintain.
Currently, there are two major approaches for search
engines: keyword search and directory search (Filman &
Pant, 1998; Gudivada, Raghavan, Grosky, & Kasanagottu,
1997; Kobayashi & Takeda, 2000). The keyword search
uses keywords to create indices on the collected documents.
Users search for required documents by providing key-
words. Because XML document schemata such as DTD
contain special operators such as ‘ p ’, ‘ þ ’, ‘?’, and ‘l’, it is
not suitable for XML repositories. The directory search uses
classification methods to classify similar documents into
categories, compare the similarity between collected
documents with user’s queries, and then present search
results. Conventional classification methods are based on
keywords, phrases, links, etc. and thus are not suitable for
XML repositories. Kotsakis (2002) and Kotsakis and Bohm
(2000) proposed an XML schema directory (XSD) which
merges similar document schemata into categories. The
similarity of document schemata is measured by using
Zhang and Shasha’s algorithm to calculate editing distances
between any two ordered trees (Zhang & Shasha, 1989).
Although XSD is shown to be fast and guarantees 100%
accuracy, XSD does not take into account the synonymy and
polysemy problem. Also, as stated earlier, the order of
elements in XML documents is not important for data-
centric applications. Thus, it is not appropriate to use Zhang
and Shasha’s algorithm to measure the similarity between
two schemata. Note that, it was proven that the calculation
of editing distance between two unordered trees is
NP-complete (Zhang, Statman, & Shasha, 1992). Therefore,
Fig. 1. Simple XML for a purchase order.
Fig. 2. An example of DTD.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224214
overcoming these problems and developing an efficient
document schema search engine is an important topic of
research that demands immediate attention.
There are several XML schema languages, such as DTD,
XML Schema, and external data representation (XDR), to
describe XML document structure. Among these schema
languages, DTD has been adopted by XML since it was
released. Because DTD is simple and has been widely
adopted in a variety of industries, in this paper, we design
and implement a search engine, called XDSearch, for
document schemata written in DTD. In XDSearch, to
resolve the synonymy and polysemy problem, the concept
of ontology is deployed. Then, the similarity between two
document schemata are measured by their Hamming
distance. Since a bias r can be specified by users, the
collected schemata with Hamming distances less than or
equal to r will be returned by XDSearch. With r ¼ 0,
XDSearch guarantees 100% accuracy. Additionally, the
ranking of search results is determined by minimum
description length (MDL) (Rissanen, 1978) which is used
to measure the concision of a document schema. Because
the time complexity of calculating Hamming distance is
high, we propose a 2-CC4 neural network, which is adapted
from the CC4 neural network, to speed up the search in this
paper. In this research, we design XDSearch with
modularity in mind such that any change to a module can
be done without affecting other modules. Therefore,
XDSearch proposed in this paper completely fulfills the
above requirements.
This paper is organized as follows. The current
development of search engine technology will be briefly
reviewed in Section 2. The proposed search engine for XML
document schemata, XDSearch, will be thoroughly
described in Section 3. In Section 4, the experimental
results of XDSearch is studied and analyzed. Finally, our
conclusion will be made in Section 5.
2. Related works
Currently, there are two major approaches for search
engines:keywordsearchanddirectorysearch (Filman&Pant,
1998; Gudivada et al., 1997; Kobayashi & Takeda, 2000).
2.1. Keyword search
In general, keyword search engines use crawlers or
robots to discover Web resources and store collected data in
server-side database systems. Search engines then create
indices on the collected data to provide fast lookup services
for users (Filman & Pant, 1998; Shu & Kak, 1999). The
most famous search engines of this type are AltaVista,
WebCrawler, Excite, and Infoseek.
There are two obvious approaches of using keyword
search on XML repositories. One approach is to index
the names of document types such as ‘purchase order’.
Users can then locate all document schemata for purchase
orders. However, this approach has some serious limi-
tations. Firstly, it is hard to find an effective ranking
algorithm for it since every schema is a purchase order.
Secondly, users may be flooded with search results if there
are a large number of document schemata for purchase
orders. Moreover, even if there exists some schemata which
are very close to users’ needs, they cannot be retrieved since
they are not categorized as purchase order.
The other approach is to index element and attribute
names of document schemata such as kDatel as shown in
Fig. 2. Users can then search for schemata that contain
queried keywords. For example, by specifying both
‘Purchase_Order’ and ‘Date’, all document schemata
containing both ‘Purchase_Order’ and ‘Date’ are retrieved.
Note that the search results will also include those schemata
that both ‘Purchase_Order’ and ‘Date’ are not their element
or attribute names, but are actually the default values of
some elements or attributes. Also, because XML document
schemata such as DTD contain special operators such as
‘ p ’, ‘ þ ’, ‘?’, and ‘l’, and because keyword search engines
use keywords to lookup the contents of schemata, they
cannot specify query like (a, (b? l c þ )). Furthermore, due
to the synonymy and polysemy problem, it is likely that,
although kupricel, which means unit price, exists in a
repository, its search engine cannot locate kupricel for a user
if the user requests for unit_price.
2.2. Subject directory search
In general, directory search engines use classification
methods to classify all collected data into a hierarchy and
allow users to search for the desired web pages by either
entering a query or navigating the hierarchy (Kobayashi &
Takeda, 2000; Szuprowicz, 1997). The most famous
search engines of this type are Yahoo, Infoseek, Excite,
and Lycos.
Conventional classification methods are based on key-
words, phrases, links, etc. With chosen classifications,
similar documents are grouped together. Search engines will
then retrieve documents that are similar to a user’s query.
However, conventional classification methods are not
suitable for XML repositories because they do not take
into account the logical structure of document schemata.
Thus, Kotsakis (2002) and Kotsakis and Bohm (2000)
proposed an XML Schema Directory (XSD). XSD aggre-
gates similar document schemata into merger schemata and
then merges similar merger schemata into higher-level
schemata. The similarity of document schemata is measured
by editing distance between two trees. Because an XML
schema is actually a tree, the distance between any two
schemata is based on the number of edit operations (such as
insertion, deletion, and substitution) that need to be
performed to convert a tree into another. The smaller the
distance between two schemata, the more similar they are.
XSD guarantees 100% accuracy and is shown to be fast
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 215
because it deploys a very fast algorithm, proposed by Zhang
and Shasha (1989), to calculate the minimum editing
distance between two ordered trees. However, XSD has
several major drawbacks. First of all, as stated earlier, the
order of elements in XML documents is not important for
data-centric applications. A schema in a form like (a, b, c ) is
as good as a schema in a form like (c, a, b ). As a result, it is
inappropriate to measure similarity between schemata by
using algorithms that calculate editing distance between two
ordered trees. Note that, it was proven that the calculation of
editing distance between any two unordered trees is NP-
complete (Zhang et al., 1992). Also, by treating schemata
with identical elements but in different order as different
schemata, the size of XSD grows rapidly. Furthermore,
XSD does not take into account the synonymy and
polysemy problem. Thus, it is likely that, in XSD, two
identical schemata, except for their root elements, are
treated as two different schemata. This will make XML
Schema Directory unnecessarily huge.
3. XDSearch: a search engine for XML document
schemata
From the earlier discussions, it is clear that there is a
strong need to propose a new measurement to compare the
similarity between two XML schemata and a new ranking
algorithm for XML schemata search engine, because editing
distance is not an appropriate way to measure the similarity
between two schemata. Garofalakis et al. (2000) stated that
an XML document can be described by more than one DTD,
and that a well-defined DTD should have two properties:
precision and concision. Precision means that the DTD
should precisely conform to the XML document. Concision
means that the DTD should be trim and streamlined. This
research will utilize and extend these two properties in the
design of XDSearch.
Precision. Let U denote a user’s request, DSi represent a
document schema in an XML repository, and f ðU;DSiÞ be
the differences between U and DSi. If f ðU;DSiÞ ¼ 0; DSi is
identical to U. For example, assume that there are three
known document schemata: DS1: (a, b ), DS2: (a, b, c ), and
DS3: (a, b, c, d ) where a, b, c, and d represent different
elements. If U is (a, b, c ), then f ðU;DS2Þ ¼ 0: In other
words, DS2 precisely matches U. Note that if U is (b, c, d ),
then none of DSi satisfies U. To increase flexibility, a radius,
denoted as r, is adopted to handle this situation. The radius
is a bias that the user can tolerate. In this example, if r is set
to be 2, both DS2 and DS3 will then satisfy U. In this
research, we use the Hamming distance (HD), denoted as
HD(x, y ), to measure the differences between vectors x and
y. Let DS1 be (1,1,0,0), DS2 be (1,1,1,0), DS3 be (1,1,1,1),
and U be (0,1,1,1). Then HD(U, DSi) are 3, 2, and 1,
respectively.
Concision. Concision means that a document schema
should be trim and streamlined. Suppose that we have XML
documents of {ab, abab, ababab} where a and b represent
different elements. There are at least two possible candidate
DTDs for these XML documents: (1) ðabÞp; (2) ablab
ðablababÞ: It is quite obvious that the first DTD is more
concise than the second one. In this research, we use MDL
to measure concision and determine the ranking order of
search results (Garofalakis et al., 2000; Rissanen, 1978).
The calculation of MDL involves the following rules:
1. Each element is counted one, but the root element is not
counted.
2. Each meta-character such as l, p , þ , ?, (, and) is counted
one.
3. The length of any substructure has to be counted.
In the example shown in Fig. 2, this DTD has a root
element called Purchase_Order. According to the first rule,
the Purchase_Order is not counted. Since there are four
elements, the MDL of the example DTD is four based on the
first rule again. Because the element ‘Detail’ has a meta-
character ‘ p ’, according to the second rule, the MDL is
now five. Note that there is a substructure Detail in the DTD
which is of length 3. Therefore, the total value of the MDL
of the example DTD is eight (i.e. MDL(DTD) ¼ 8).
If users do not specify any constraint on the queried
document structure, the document schemata with smaller
MDL should be ranked higher. However, if there is any
constraint imposed on the queried document structure, the
search results will be sorted in descendent order based on
the value of lMDL(U ) 2 MDL(DSi)l. For example, assume
that there are three candidate DTDs: DTD1: (a, b, c ), DTD2:
(a, b p , c ), and DTD3: (a, b, c p ). If U is (a, b p , c ), the
values of lMDL(U ) 2 MDL(DTDi)l are 1, 0, and 0;
respectively. Thus, the order of the search result is DS2,
DS3, DS1.
XDSearch is composed of the information component,
search component, and interface component, and its
architecture is shown in Fig. 3.
3.1. Information component
The information component collects and provides data
for the search engine to work on. These data can be stored in
various formats such as flat files, relational databases, or
even native XML databases. In XDSearch, all data are saved
in a relational database except for DTD files. The
information component is composed of a document schema
repository, a term table, and schema tables.
3.1.1. Document schema repository
The document schema repository consists of XML
document schemata described in either DTDs, XML
Schemas, or XDR. In the current design of XDSearch,
only DTD is considered. Thus, the document schema
repository is also called DTD repository. Each DTD file is
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224216
actually stored in the DTD repository and is accessible
through the Internet.
3.1.2. Term table
As described earlier, since XML is a meta-language,
developers can define their own tags, and this results in the
synonymy and polysemy problem. To resolve the synonymy
and polysemy problem, the concept of ontology has to be
incorporated into the design of search engines for XML
document schemata.
An ontology generally contains a hierarchy of a set of
objects within a domain and describes the relationships
among these objects (Chandrasekaran, Josephson, &
Richard Benjamins, 1999; Decker et al., 2000). For
example, business documents may contain fields of ‘order
quantity’ and ‘number of return’. An ontology may be used
to illustrate them as ‘qty þ ’ meaning the inventory stock is
increased. In other words, both order quantity and ‘number
of return’ are mapped to ‘qty þ ’, and this can bee seen in
Fig. 4. Additionally, the terms used in the term table can be
further clustered into categories. For example, in Fig. 4,
‘qty þ ’, ‘qty 2 ’, and ‘money’ are clustered into the
category named ‘numeric’ which is also a term name.
In XDSearch, all element and attribute names are saved
in the field ‘Tag_Name’ of the table ‘DTD_Detail’ as shown
in Fig. 5(a). Each ‘Tag_Name’ is associated with (a)
a specific DTD file which is referenced by the ‘DTD_No’
that in turn points to the URL of the DTD file and (b) a term
name which is referenced by the ‘Term_No’ to which it
belongs. Note that the term table also indicates to which
category a term belongs. Because a category name is
actually a term name, we have a recursive relationship from
the ‘Category_No’ to ‘Term_No’ in the term table. The
terms used in the term table must be unique. To store
the relationships shown in Fig. 4, the tables are created as
shown in Fig. 5(b).
3.1.3. Schema tables
The schema tables contain the brief descriptions, the
URLs, and the MDL values of registered document
schemata as well as all the element and attribute names in
these schemata. In XDSearch, the schema tables include
both ‘DTD’ and ‘DTD_Detail’ tables, as shown in Fig. 5(a).
For data-centric applications, what they concerned are
the data that contained in XML documents. Therefore, a
schema in a form like (a, b, c ) is as good as a schema in a
form like (c, b, a ) (Bourret et al., 2000; Shanmugasundaram
et al., 1999). As a result, XDSearch shall retrieve all
registered schemata as long as their element names match
user’s requests. Additionally, based on Lu and Hwang
(2001) observations, attributes can be treated as elements
without losing any information for data-centric applications.
Thus, attributes, like elements, are treated as terms in
XDSearch. For example, for the element ‘Qty’ in Fig. 2,
there is an attribute ‘unit’. In XDSearch, both ‘Qty’ and
‘unit’ are saved in the ‘DTD_Detail’ table, and thus the
MDL value of the DTD is incremented by two, not one.
3.2. Interface component
The goal of the interface component is to facilitate the
use and management of the search engine. The management
Fig. 3. The architecture of XDSearch.
Fig. 4. Ontology.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 217
interface is to provide system managers an easy-to-use
interface for maintaining XDSearch. In the user interface,
users may select terms from different categories. Fig. 6
shows an example interface we develop for the XDSearch.
A selected term can be configured as the child element of the
root element or any selected element. For example, as
shown in Fig. 6, both ‘Item_No’ and ‘Price’ are the sub-
elements of ‘Items’. The radius of HD can also be chosen by
users. Additionally, each selected term can be associated
with a meta-character such as p , þ , ?, or l by clicking
either on button ‘Has ( p , þ ,?)’ or on button ‘Has (l)’. For
example, ‘Items’ in Fig. 6 is associated with a meta-
character ‘ þ ’.
3.3. Search component
The search component is the kernel of XDSearch.
There are three modules in the search component: a
comparison module, a ranking module, and a mainten-
ance module.
3.3.1. Comparison module
The comparison module is to compare Us with DSi in the
document schema repository. In XDSearch, HD is used for
measuring the differences between two DTDs. Initially, all
DTDs, including Us and DTDi, are turned into binary
vectors of length n where n is the number of terms in the
term table. According to the sequence of terms in the term
table, each element of the vectors is set to be either 1 or 0. If
an element of a DTD is associated with the ith element of
the vector, the ith element of the vector is set to be 1;
otherwise it is set to be 0. For example, if the length of the
term table is nine and a purchase order has six terms, this
purchase order is transformed into a vector of length nine,
and the vector has six elements of 1 as shown in Fig. 7. After
the comparisons, all the DTDs with HD(U, DTDi) , r are
included in the search results. Note that, with r ¼ 0,
XDSearch guarantees 100% accuracy.
The algorithm of calculating HD is shown in Fig. 8.
Assume that the number of DTDs in the DTD table is m and
the number of terms in each DTD is p. From Fig. 8, it is easy
to see that the time complexity of calculating HD is O(mp 2).
In the following sections, we will study the CC4 network
and design a 2-CC4 network to speed up the calculation.
CC4 neural network. The CC4 algorithm, proposed by
Kak and Tang, is a corner classification training algorithm
for three-layered feed-forward neural networks (Kak, 1993;
Tang & Kak, 1998). The architecture of CC4 is shown in
Fig. 9. The input and output data are all binary vectors, and
the weights are all integers. The number of input neurons is
the length of the input vector plus one. The additional
neuron is the bias neuron with a constant input of 1.
Fig. 8. Algorithm of HD.
Fig. 7. Conversion of a DTD to a vector.
Fig. 5. Ontology described in RDBMS.
Fig. 6. An example DTD query.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224218
The number of neurons in the hidden layer is the number of
training patterns, and each hidden neuron represents one
training pattern. The number of output neurons depends
upon the user’s requirement. The connections between the
input layer and the hidden layer as well as between the
hidden layer and the output layer are fully connected.
The training process of the CC4 neural network is
described as follows. The weight of the connection from the
input neuron i, where i ¼ 1,2,…,n, to the hidden neuron j,
where j ¼ 1,2,…,H, is denoted as wij, and Xji is the value of
ith element of the jth training pattern (or training vector).
The value of wij is determined by Eq. (1) where r is the user-
defined radius and s is the number of 1’s in the training
vector.
wij ¼
1; Xji ¼ 1
21; Xji ¼ 0
r 2 s þ 1; i ¼ n
8>><>>: ð1Þ
For the jth neuron of the hidden layer, its activity function is
defined as netj ¼P
iXiwij: The transfer function f(netj),
which is the output value of the jth neuron of the hidden
layer, is shown in Eq. (2).
Hj ¼ f ðnetjÞ ¼1; netj . 0
0; netj # 0
(ð2Þ
The weight of the connection from the hidden neuron j to the
output neuron k is denoted as ujk, and Yjk is the expected
value of the kth output neuron for the jth training pattern.
The weights of the hidden layer to the output layer are
assigned according to Eq. (3).
wjk ¼1; Y
jk ¼ 1
21; Yjk ¼ 0
8<: ð3Þ
The activity function of the output neuron k is defined as
netk ¼P
kHjwjk: The transfer function f(netk), which is the
output value of the kth neuron of the output layer, is shown
in Eq. (4).
Yk ¼ f ðnetkÞ ¼1; netk . 0
0; netk # 0
(ð4Þ
The CC4 network is fast, requires only one-time training,
and can determine which DTD fulfills the user’s demand.
However, because CC4 is a supervised network, the
expected output values must be given in the training
process. To be able to query with different radiuses, the CC4
network has to be trained with all possible radiuses.
Therefore, it is impractical to use the CC4 network for
querying document schemata.
The 2-CC4 Algorithm. By modifying the CC4 network,
we designed a new neural network. The modified CC4
network, called 2-CC4, is a two-layered CC4 network. The
second half of the original CC4 network is cut off.
The architecture of the 2-CC4 network is shown in Fig. 10.
The input layer and output layer of the 2-CC4 network are
identical to the input layer and the hidden layer of the
original CC4 network, respectively. The connections
between the input layer and the output layer of the new
network is also fully connected.
The weights of the input layer to the output layer are
assigned according to Eq. (5). The activity function and the
transfer function of the jth output neuron are shown in Eqs.
(6) and (7), respectively. Because the radius can be
dynamically determined by the user, r is now associated
with the calculation of the activity function, as shown in
Eq. (6).
wij ¼
1; Xji ¼ 1
21; Xji ¼ 0
1 2 s; i ¼ n
8>><>>: ð5Þ
netj ¼X
i
Xiwij þ r ð6Þ
Yj ¼ f ðnetjÞ ¼netj; netj . 0
0; netj # 0
(ð7Þ
Fig. 9. Architecture of CC4 neural network.
Fig. 10. Architecture of 2-CC4 network.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 219
Suppose that there are three training vectors: (1,1,1), (0,0,0),
and (0,0,1). We can then obtain the weights of the input
layer to the output layer as shown in Table 1. Providing
inputs of U1 whose input vector is (1,1,1) and r ¼ 0, we
obtain that the output vector netU1
j is (1,0,0), and this
indicates that the first training vector matches U1. Similarly,
if U2: {(1,0,1); r ¼ 0} and U3: {(1,0,1); r ¼ 1}, the output
vectors netU2
j and netU3
j are (0,0,0) and(1,0,1), respectively.
In other words, there is no training vector that matches U2.
However, if we tolerate a bias of one, both the first and the
third training vectors match (1,0,1).
The algorithm of 2-CC4 can be divided into two parts:
learning and testing as shown in Fig. 11. The learning
algorithm will be executed only when the DTD table (or the
term table) is updated. The testing algorithm will be
executed whenever a search query is requested. Assume
that the number of DTDs in the DTD table is m and the
number of terms in the term table is n. As shown in Fig. 11,
the time complexity of the testing algorithm is O(mn ).
When n is smaller than p 2, 2-CC4 is better than HD.
3.3.2. Ranking module
The ranking module is to evaluate those DSi with HD(U,
DSi) , r, denoted as DSri ; and to sort DSr
i based on HD(U,
DSir) and MDLðDSr
i Þ: The ranking module can be divided
into two parts according to their functions: calculation and
sorting.
Calculation. The length required for describing DSri is
measured by MDL as described earlier. In the calculation
function, the MDL for each DSri ; denoted as MDLðDSr
i Þ; is
calculated.
Sorting. The sorting function sorts DSri based on
HD(U, DSir) and MDLðDSr
i Þ: The major sorting criterion
is HD(U, DSir). Firstly, the sorting function sorts DSr
i in
ascending order based on the value of HD(U, DSir). For
those DSri with identical HD(U, DSi
r), the ordering is
then based on MDLðDSri Þ: If a user does not specify any
constraint on the logical structure of U, DSir is sorted in
ascending order based on the value of MDLðDSri Þ;
otherwise, DSri is sorted in ascending order based on
the value of lMDLðUÞ2 MDLðDSri Þl:
3.3.3. Maintenance module
This module accepts requests from the management
interface to perform the requested tasks. All fundamental
maintenance functions such as insertion, deletion, and
modification are included in this module. The major tasks
accomplished by the maintenance module are to assure the
accuracy and consistency of the term table and the schema
tables.
4. Experiments
For interoperability, XDSearch was developed in Java.
Microsoft Access was used to store terms and schema
information except for DTD files. Jigsaw, a Java Servlet
server, was used as the servlet engine and the Web server. All
our experiments were executed on a PC with an Intel Celeron
300 processor and 256 Megabytes of main memory. The
operating system installed on the PC is Microsoft Windows
2000 professional. All programs were developed in JDK 1.2.
4.1. Performance comparisons
Lu et al. (2001) defined 12 DTDs for Taiwan’s flower
distribution channel (Lu et al., 2001). The smallest and
largest numbers of elements and attributes in these DTDs
are 20 and 43, respectively, and the average number is 32.
Therefore, we design our experiments such that the number
of terms in each DTD is between 20 and 40. Each
Fig. 11. 2-CC4 algorithms.
Table 2
Training and search time for 2-CC4 and HD (in ms) with p ¼ 30 and
n ¼ 100
No. of DTDs 2-CC4 HD
Training time Search time Search time
1000 10 0.010 371 771
2000 20 0.010 811 1552
3000 30 0.010 1212 2323
4000 40 0.010 1612 3094
5000 50 0.010 2033 3865
6000 60 0.010 2414 4646
7000 71 0.010 2835 5417
8000 90 0.011 3225 6179
9000 100 0.011 3635 6970
10,000 110 0.011 4016 7751
Table 1
Realization of schema matching using 2-CC4
Training
vectors
s Weights netðU1
j Þ netðU2Þj net
ðU3Þj
1 1 1 3 1 1 1 22 1 0 1
0 0 0 0 21 21 21 1 0 0 0
0 0 1 1 21 21 1 0 0 0 1
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224220
experiment is conducted in various configurations. The
numbers of terms in the term table are 50, 100, 150, and 200.
The numbers of DTDs in the DTD table are within the range
between 1000 and 10,000. Because the search time for both
2-CC4 and HD is close to zero milliseconds (ms) when there
is only one request, the execution time of each test is
measured by running one request 100 times. The contents of
all DTDs and the user request U are randomly generated
from the terms in the term table.
In the first experiment, all DTDs in the DTD table and the
user request are generated at random. The number of terms
for all DTDs in this experiment is 30 (i.e. p ¼ 30). Table 2
lists the experimental results when the length of the term
table is 100 (i.e. n ¼ 100). In the table, column one indicates
the number of DTDs in the DTD table which is denoted as
m, column two is the training time required for 2-CC4, and
column three is the average training time for each DTD. As
shown in the table, the average training time for each DTD
is about 0.01 ms, and the training time grows linearly as m
increases. Also as shown in the table, the search time of 2-
CC4 is faster than HD. When m increases while both n and p
remain fixed, the search times for both 2-CC4 and HD are
increased linearly. This is because both the time complex-
ities of 2-CC4 and HD are directly proportional to m.
Additionally, the training time of 2-CC4 is quite short. This
makes it possible to retrain the 2-CC4 network online
without interrupting services.
In the second experiment, we intend to measure the
influence on search time when n is varied. Table 3 shows the
experimental results when p ¼ 30. From the table, it is easy
to tell that the search time of 2-CC4 increases as n grows.
However, the increases in the search time of HD are
relatively insignificant as n is varied. For m ¼ 5000, as n
grows from 50 to 100, 150, and 200, the search time for
2-CC4 increases by about 41% ((2193 2 1553)/1553), 79%
((2774 2 1553)/1553), and 101% ((3115 2 1553)/1553),
respectively. Similarly, for HD, the search time increases by
about 6, 9, and 11%. This is because the time complexity of
HD has nothing to do with n.
The third experiment is designed to find out the influence
on search time when p is varied. Table 4 lists the
experimental results when n ¼ 100. For m ¼ 5000, as p
grows from 20 to 25, 30, 35, and 40, the search time for HD
increases by about 32% ((11,597 2 8793)/8793), 79%
((15,613 2 8793)/8793), 133% ((20,460 2 8793)/8793),
and 189% ((25,416 2 8793)/8793), respectively. Similarly,
for 2-CC4, the search time increases by about 20, 29, 43,
and 53%. Therefore, it is obvious that the search time
required by HD increases much faster than that required by
2-CC4 as p grows.
All the experiments described so far are in the worst case
because we assume that all DTDs in the DTD table have
identical number of terms (i.e. p ¼ 30 for all experiments
conducted so far). However, the DTDs in the DTD table in
general do not have identical number of terms. Therefore, in
the fourth and fifth experiments, the contents and numbers
of all DTDs in the DTD table are generated randomly. In the
fourth experiment, we intend to find out the influence on
search time when the radius is varied. Table 5 shows the
experimental results when m ¼ 1000 and n ¼ 100. Accord-
ing to the table, it is obvious that 2-CC4 is at least 3.8 times
faster than HD.
Table 6 shows the experimental results of the fifth
experiment when n ¼ 100 and the number of terms of U is
30. With the most commonly chosen radiuses (i.e. r is either
0, 1, 2, or 3) when the number of terms of U is set to be 30
(which is the average number of terms found in Taiwan’s
flower distribution channel), 2-CC4 is still faster than HD.
Table 3
Search time for 2-CC4 and HD (in ms) with p ¼ 30
Number of DTD Number of terms in the term table
50 100 150 200
2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD
1000 290 2925 410 3085 511 3165 591 3185
5000 1553 14,571 2193 15,512 2774 15,913 3115 16,183
10,000 3055 29,562 4357 32,557 5568 33,358 6279 33,549
Table 4
Search time for 2-CC4 and HD (in ms) with n ¼ 100
Number of DTDs Number of terms of U
20 25 30 35 40
2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD
1000 341 1743 380 2304 410 3064 491 4016 521 5027
5000 1762 8793 2123 11,597 2273 15,613 2523 20,460 2694 25,416
10,000 3535 17,635 4216 24,375 4526 32,567 4978 41,520 5438 51,384
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 221
In fact, all experimental results indicate that 2-CC4 is far
superior to HD.
4.2. Discussions
From the previous discussions, it is shown that
Precision. XDSearch uses Hamming distance to measure
the similarity between any two document schemata. Users
are allowed to specify the value of r. With r ¼ 0, XDSearch
guarantees 100% accuracy. When r . 0 is specified by
users, more similar document schemata will be retrieved.
Concision. MDL is deployed as the ranking algorithm for
XDSearch. If a user only knows what elements should be
included in a schema but has no idea about the structure of
the schema, DTDs with smaller MDL in the repository
should be ranked higher than other DTDs with larger MDL.
Additionally, if the user specifies constraint (such as ‘ p ’,
‘ þ ’, ‘?’, ‘l’, etc.) on document structure, DTDs with the
smaller lMDLðUÞ2 MDLðDSri Þl should be ranked higher
than other DTDs with larger lMDLðUÞ2 MDLðDSri Þl: For a
query like the one shown in Fig. 6, its search results are
shown in Fig. 12. Because the MDL of the query is 11, the
results are sorted in ascending order of lMDLðUÞ2 MDL
ðDSri Þl:
Consideration of the synonymy and polysemy problem.
The concept of ontology was incorporated into the design of
XDSearch and was implemented in the term table and the
schema tables. Therefore, document schemata for purchase
orders, which were defined as either kPuchase_Orderl,kOrdersl, or kPOl in DTD, can be found. An example is
Table 6
All DTDs are generated with a random number of terms when the length of the term table is 100
Radius 0 1 2 3
No. of DTD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD
1000 40 260 121 731 190 1241 290 2043
2000 141 560 351 1622 480 2644 661 4196
3000 231 781 530 2533 721 3996 1001 6359
4000 310 1152 691 3194 941 5098 1312 8242
5000 411 1402 871 4076 1182 6369 1642 10,295
6000 501 1803 1051 4827 1423 7691 2023 12,659
7000 521 1853 1192 5578 1642 8953 2323 14,662
8000 661 2403 1422 6680 1933 10,575 2684 16,995
9000 721 2644 1552 7391 2143 11,797 3024 19,168
10,000 801 2955 1743 8302 2353 13,048 3305 21,290
Table 5
Search time for 2-CC4 and HD (in ms) with m ¼ 1000 and n ¼ 100
Radius Number of terms of U
20 25 30 35 40
2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD 2-CC4 HD
0 50 191 40 281 40 260 60 361 51 350
1 111 510 130 651 121 731 160 991 181 1241
2 160 801 180 962 190 1241 241 1712 250 1903
3 200 992 230 1302 241 1682 301 2303 320 2614
4 240 1182 260 1553 290 2043 340 2774 361 3154
5 271 1282 290 1733 330 2324 371 3065 411 3645
6 270 1412 311 1893 351 2614 400 3404 430 3996
7 290 1483 330 2053 370 2714 420 3595 461 4186
8 290 1530 351 2093 390 2865 440 3816 470 4467
9 301 1552 351 2193 390 2935 451 3915 491 4626
Fig. 12. Example search results.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224222
shown in Fig. 12. Moreover, as shown in Fig. 13, elements (for
examples, ‘No’ and ‘Date’) match ‘Doc_No’ and ‘DateTime’
elements of the example query, respectively. Although the
term table and the schema tables are currently defined in
relational databases, it can be replaced by native XML
database such as Software AG’s Tamino or eXcelon’s XIS.
Speed. As shown in the experimental results, XDSearch
is really fast. In the worst case, each U needs about 4.44 ms
using 2-CC4 algorithm when the number of known DTDs is
1000 and the length of term table is 100. Also, as shown in
the first experiment, the training time is less than 110 ms.
Thus, this makes it possible to retrain the 2-CC4 network
online without interrupting services.
Maintainability. We design XDSearch with modularity
in mind such that any change to a module can be done
without affecting other modules. Moreover, we developed
several utilities to help system administrative works.
5. Conclusions and future works
In facilitating the development of the applications of
electronic commerce, great efforts have been dedicated to
the development of XML repositories. One of the core
features of XML repositories is to provide users an
intelligent search utility to locate reusable entities such as
document schemata. However, all known search engines are
not suitable for searching document schemata. In this paper,
we proposed an efficient search engine for XML document
schemata. This search engine, called XDSearch, provides
speedy and accurate search for developers to locate
document schemata. Additionally, the development of
XDSearch plus document schema extractors such as
XTRACT (Garofalakis et al., 2000) and DTD-Miner (Moh
et al., 2000) helps research in areas such as web mining,
web-based databases, etc.
Currently, this model has one major drawback: newly
collected XML schemata cannot be translated and placed in
the DTD table automatically. This translation requires
human intervention to correctly map an element with a
specific term in the ontology. It may require huge efforts in
the generation of the DTD table. However, this is a one-time
task. Once more and more DTDs are collected, what
remains would be a minor job for the system managers to
update the DTDs.
Acknowledgements
This research was supported in part by the Research
Board of Chaoyang University of Technology, Taiwan,
ROC, under contract number: 89-A016.
References
Bourret, R., Bornhovd, C., & Buchmann, A (2000). A generic load/extract
utility for data transfer between XML documents and relational
databases. Proceedings of the Second International Workshop on
Advanced Issues of E-Commerce and Web-based Information Systems
(pp. 134–143).
Chandrasekaran, B., Josephson, J. R., & Richard Benjamins, V. (1999).
What are ontologies, and why do we need them? IEEE Intelligent
Systems, 14, 20–26.
Ciancarini, P., Vitali, F., & Mascolo, C. (1999). Managing complex
documents over the WWW: a case study for XML. IEEE Transactions
on Knowledge and Data Engineering, 11, 629–638.
Decker, S., Melnik, S., Van Harmelen, F., Fensel, D., Klein, M., Broekstra,
J., Erdmann, M., & Horrocks, I. (2000). The semantic web: the roles of
XML and RDF. IEEE Internet Computing, 63–74.
Fernandez, M., Tan, W.-C., & Suciu, D. (2000). SilkRoute: trading between
relations and XML. Computer Networks, 33, 723–745.
Filman, R. E., & Pant, S. (1998). Search the Internet. IEEE Internet
Computing, 21–23.
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000).
XTRACT: a system for extracting document type descriptors from
XML documents. SIGMOD, 29(2), 165–176.
Gudivada, V. N., Raghavan, V. V., Grosky, W. I., & Kasanagottu, R.
(1997). Information retrieval on the world wide web. IEEE Internet
Computing, 58–68.
Iacovou, C. L., Benbasat, I., & Dexter, A. (1995). Electronic data
interchange and small organizations: adoption and impact of technol-
ogy. MIS Quarterly, 465–485.
Kak, S. C. (1993). New training algorithm in feedforward neural networks.
In P. P. Wang (Ed.), Advances in fuzzy theory and technologies. First
International Conference on Fuzzy Theory and Technology, Durham,
NC, October 1992, Durham, NC: Bookwright Press.
Kobayashi, M., & Takeda, K. (2000). Information retrieval on the web.
ACM Computing Surveys, 32, 144–173.
Kotok, A (1999). White Paper on Global XML Repositories for XML/EDI.
XML/EDI Group.
Kotok, A (2002). Government and finance industry urge caution on XML.
XML.com.
Kotsakis, E. (2002). XSD: a hierarchical access method for indexing XML
schemata. Knowledge and Information Systems, 4, 168–201.
Kotsakis, E., & Bohm, K (2000). XML schema directory: a data structure
for XML data processing. Proceedings of the First International
Conference on Web Information Systems Engineering (pp. 62–69).
Lu, E. J.-L., Chou, S., & Tsai, R.-H. (2001). An empirical study of XML/
EDI. Journal of Systems and Software, 58, 269–277.
Lu, E. J.-L., & Hwang, R.-J. (2001). A distributed EDI model. Journal of
Systems and Software, 56(1), 1–7.
Moh, C. -H., Lim, E. -P., & Ng, W. -K (2000). DTD-Miner: a tool for
mining DTD from XML documents. Proceedings of Second Inter-
national Workshop on Advanced Issues of E-Commerce and Webbased
Information Systems (pp. 144–151).
Fig. 13. The DTD for P_Order.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224 223
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14,
465–471.
Shanmugasundaram, J., Tufte, K., & He, G (1999). Relational databases for
querying XML documents: limitations and opportunities. Proceedings
of the 25th VLDB Conference (pp. 302–314).
Shu, B., & Kak, S. (1999). A neural network-based intelligent metasearch
engine. Information Sciences, 120, 1–11.
Szuprowicz, B. O. (1997). Search engine technologies for the
world wide web and Internet. Computer Technology Research
Corp, USA.
Tang, K.-W., & Kak, S. C. (1998). A new corner classification
approach to neural network training. Circuits, Systems, and Signal
Processing, 17(4), 459–469.
Webber, D. R. R. (1998). Introducing XML/EDI frameworks. Electronic
Markets, 8, 38–41.
Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between
unordered labeled trees. Information Processing Letters, 42, 133–139.
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing
distance between trees and related problems. SIAM Journal of
Computing, 18(6), 1245–1262.
E.J.-L. Lu, Y.-M. Jung / Expert Systems with Applications 24 (2003) 213–224224