Induction of Integrated View for XML Data with ... · Induction of Integrated View for XML Data with HeterogeneousDTDs Euna Jeong Computer Science and Information Engineering National

Induction of Integrated View for XML Datawith Heterogeneous DTDs

Euna JeongComputer Science and Information Engineering

National Taiwan University#I Roosevelt Rd. Sec. 4, Taipei, Taiwan

[email protected]

ABSTRACTThis paper proposes a novel approach to integrating het-erogeneous XML DTDs. With this approach, an informa-tion agent can be easily extended to integrate heterogeneousXML-based contents and perform federated search. Basedon a tree grammar inference technique, this approach derivesan integrated view of XML DTDs in an information integra-tion framework. The derivation takes advantages of nam-ing and structural similarities among DTDs in similar do-mains. The complete approach consists of three main steps.(1) DTD clustering clusters DTDs in similar domains intoclasses. (2) Schema learning applies a tree grammar infer-ence technique to generate a set of tree grammar rules fromthe DTDs in a class from the previous step. (3) Mmamiza-tion optimizes the rules generated in the previous step andtransforms them into an integrated view. We have imple-mented the proposed approach into a system called DEEPand tested the system on artificial and real domains. Theexperimental results reveal that this system can effectivelyand efficiently integrate radically different DTDs.

KeywordsXML DTD, Semistructured data, Federated search, Dis-tributed databases, Intelligent Agent, Mark-up schemes

1. INTRODUCTIONSoftware agents [7, 131 and integration systems of het-

erogeneous databases [3, 11, 12, 61 are widely studied anddeveloped to allow users to find, collect, filter and manageinformation sources spread on the Internet. The design con-cerns of these systems vary for different domains, but allshare a common need for a layer of an integrated view andsource descriptions to seamlessly integrate heterogeneous in-formation sources. The integrated view must be designedfor each application domain. The source descriptions areneeded to map source schemas to the integrated view. How-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.CIKM'OI, November 5-10,2OOl, Atlanta, Georgia, USA.Copyright 2001 ACM l-581 13-436-3/01/0011...$5.00.

Chun-Nan HsuInstitute of Information Science

Academia SinicaNankang, Taipei, Taiwan

[email protected]

ever, previous work in information integration requires bothof them be constructed manually in a laborious and time-consuming manner. Constructing integrated view becomesone of the major bottlenecks for building large-scaled infor-mation integration agents(IIAs).

The approach presented in this paper is based on theframework of previous work in information integration. Inparticular, this approach addresses the problem of automaticderivation of the integrated view for XML DTDs(DocumentType Definition) [l]. Although XML is becoming an indus-trial standard for exchanging data on the Internet, when themaintenance of the information sources is independent of theintegrator, it is difficult and sometimes impossible to havesuch a common DTD. This paper reports our preliminaryresult for resolving this problem.

1.1 XML Information IntegrationFigure 1 shows the diagram of an XML information inte-

gration agent. When the user submits a request to the agentthrough a user interface, the request will be translated intoan XML-QL [4] query based on an integrated view, whichis a query template that helps formulate queries. Given thetranslated query, the query decomposer transforms the queryinto a set of subqueries against each integrated XML infor-mation source based on source descriptions. Finally, queryexecutor issues the subqueries to each information source,combines the answers, and returns the result as an XMLdocument to the user.

This paper describes how to automatically derive the in-tegrated view and source descriptions by a view inferencesystem in Figure 1. The derivation is conducted offline be-fore the IIA can provide service. The view inference systemis to automatically discover the association between closelyrelated DTDs, identify elements with similar underlying se-mantics, and generate an integrated view that covers thesesemantically similar elements.

EXAMPLE 1. Suppose we have LOUT DTDs as shown inTable 1. They are extracted from published papers and docu-ments: (a) and (c) from [d], (b) from [8], and (d) from [2].This set of DTDs will serve as the example input and willbe used to explain OUT view integration approach throughoutthe paper. Table 2 shows an integrated view that may be de-rived from two DTDs in Table l(a) and (b). Reformulatedsubqueries for these two DTDs are shown in Table 3. In thiscase, we assume that there are two related XML documents,COOKBOOK.xml with COOKBOOK DTD and BIB.xml with BIBD T D . 0

151

Figure 1: XML information integration agent

(a) COOKBOOK DTDI (!ELBMENT cookbook (title, author+, year, isbn, publisher))2 (!ELEMENT author (authorname))3 (!ELEMF.NT authorname (firstname, lastname))4 (!ELEnENT publisher (name, address))(b) BIB DTD

6 (!ELFXBNT bib (title, author+, publisher, price))6 (!ATTLIST bib year CDATA #REQUIRED)7 (!ELEMENT author (last, first))8 (!ELEH!XT publisher (name, email))(c) MOVIE DTD

9 (!ELEnBNT movie (title, year, director*, featuring, genre))10 (!ELEH!dNT featuring (creditactor*, actors*))

(d) LIST DTD11 (!ELEKENT list (movie*))12 (!ELEMENT movie (title, year, directedby, genres, featuring))13 (!BLEMENT directed-by (director*))14 (!ELEnBNT genraa (genre*))16 (!ELEEIENT actor (firstname, lastname))

Table 1: Example DTDs

1.2 Illustrative ExamplesXML data is an instance of semistructured data. In XML,

each element represents a logical component of a document.Each element has a tag to indicate its semantics. Elementscan be atomic (i.e., character strings) or contain other sub-elements, which allow us to encode structural semantics inan application domain. The definitions of the role of tagsare formally given in a DTD. A DTD establishes a set ofconstraints for XML documents. XML with a DTD is self-descriptive and provides a semistructured data model.

Distinct DTDs in a similar application domain may usedifferent labels for the same concept, or use the same labelfor different concepts. The same concepts can also be definedwith different structures. In spite of the possible diversity, ifthe underlying concepts of a set of DTDs are closely related,though they are created by different authors, they will revealstructural and naming similarities.

Table 4 shows several example cases that our approachcan handle. The elements of pairs of input DTDs are shownin the second column and their corresponding subsumptionrelationships are illustrated as trees in the third column. Foreach case, the view inference system can integrate the pairof trees into a single tree as shown in the fourth column.

We will describe how the view inference system resolvesthese csaes in the following Sections. The view inference sys-tem consists of three major components shown in Figure 2.A brief description of each module is as follows:

DTD cluster takes a collection of source DTDs as input andclusters similar DTDs into DTD classes. We mergesimilar element types as a preprocessing step to speedup the clustering. The DTD cluster uses hierarchical

1 UBEBE <cookbooklbib>2 <title>$title</>3 <*>4 <authorIauthorname>5 <firstnamaIfirst>$first</>6 <lastnamaIlast>$last</>7 a8 c/>9 <publisher>10 <address>$address</>11 <WU0~>SIW!lC2</>12 <amail>$amail</>13 us14 <ye?ir>Eyear</>16 <isbn>$isbn</>16 <price>Sprice</>17 +18 CONSTRUCT result etatements

Table 2: Integrated view for BOOK domain

(a) For COOKBOOK.xml (b) For BIB.rmllWB!aE WERE2 <cookbook> <bib year = $year>3 <title>$title</> <title>Stitle</>4 <author> <*>5 <authorname> <author>6 <firstname>$first</> <first>$first</>7 <lastname>$last</> <last>$last</>8 </> x/s9 c/a (0

10 <publisher> <publisher>11 <address>$address</> <email>$email</>12 <name>tiame </> <nalua>bame </>13 </>14 cyear>$yearc/>15 <isbn>$isbn</>16 </a I N “COOKBOOK.xml”17 CONSTRUCT result statements

x/s<price>$price</>

</> I N “BIB.xml”CONSTRUCT W8dt 8t&?f?h?flt8

Table 3: Reformulated queries for BOOK domain

agglomerative clustering method [16] to cluster DTDs.We define a generic tree matching method [14] to com-pute the distance between DTDs for clustering.

Schema learner infers the general rules describing source DTDsin each DTD class. Our approach is based on a treegrammar inference technique [9], which takes a finitesample of trees as input and induces a tree grammarthat describes the sample trees. In our approach, theinput trees correspond to XML DTDs and the treegrammar corresponds to an integrated schema.

Minimizer optimizes the learned tree grammar rules. Thelearned rules are adjusted to fit the characteristics ofDTD and transformed into the integrated view as wellas the source descriptions to be used in IIAs.

-_

: ; - - - w ( r ) - -

. _ - *A dkcfim ofXML DTDs

Result set of integrated dlsmas,

:‘\~,:*------------------

,- ,‘.__,

Process im ____, D&flow _________.

Figure 2: Diagram of the view inference system

The reminder of the paper is organized as follows. Sec-tion 2 defines the problem of view inference for XML DTDs.

1 5 2

I Example DTDs Input output

egl:<ELEMENT author (firstname,lastname)> author author .

eg2:<ELEMENT author authorname> I I1 <ELEMENT authorname (firstname,lastname)> firstname lastname authorname authorlauthomame

firstname lastname f i rs tname lastname

The same concept may have the same element name, but has different hierarchical structures.

egl:<ELEMENT publisher (name,email)> publisher author authorlpublisher2 eg2:<ELEMENT author (name,email)>

n a m e email n a m e email n a m e email

The two different concepts have the same sub-elements list.

egl:<ELEMENT featuring #PCDATA> featuring featuring featuring

3 eg2:cELEMENT featuring (creditactor,actor)>credit-actor actor credit-actor actor

The same concept may have the same element name, but has different sub-elements list in different DTDs.

egl:<ELEMENT book (year, . ..)> b o o k b i beg2:IATTLIST bib year #CDATA>

book(bib

4 year . . . year . . . year . . .

The same concept can be defined as an element in one DTD, and an attribute in another DTD. Because attribute names arehandled the same as element names, the tree structures are equal. The difference will be described in Section 2.

Table 4: Several example cases that our approach can handle: Emmple DTDs: pairs of input DTDs; Input: treestructure of DTDs; Output: tree structure of integrated view

Section 3 describes the details of our view inference ap-proach. The approach has been implemented into a systemcalled DEEP and tested in several domains. We report theexperimental results in Section 4. Section 5 reviews relatedwork. Finally, Section 6 draws conclusions.

2. A MODEL OF XML VIEW INFERENCE

2.1 Types and DTD ClassWe model a DTD as a labeled, directed tree. The nodes

in the tree represent objects and are labeled with an elementor attribute name. We assume that each tree has a distin-guished root object, and all objects are reachable from theroot. Each node in the tree has its own type. The type of anobject is defined by its label and its immediately adjacentchild nodes (objects). XML attributes are treated in thesame way as element tags.

Each type is denoted by tiy where i is the type id. Thetype id of all leaf nodes (i.e., #PCDATA type) is 0, whichimplies that the value of atomic objects is ignored in thispaper. Each type has a type definition of the form [label:Type(label)], where label is a regular expression over a finiteset N of names and Type(labe1) is either #PCDATA for leafnodes or a regular expression over N with type id as thesubscription.

A DTD schema consists of a sequence of type definitions.

EXAMPLE 2. Given the set 2, of Source DTDs in Table 1,the following type set F can be constructed. The underlinedlabels, such os year0 of ts, correspond to XML attributes. Inthis example, tl - t4 comprise type definitions for COOKBOOKDTD, ts - ty for BIB DTD, ts - tg for MOVIE DTD, andthe others for LIST DTD. For brevity, the #PCDATA typeelements are abbreviated. 0TV = [ c o o k b o o k : (tit&, ( a u t h o r s ) + , yew-o, isbno, publiaher~)];ts = [ a u t h o r : (authornames)];ts = [authomwme : (firstnamen, lo s tname , , ) ] ;tl = [ p u b l i s h e r : (name, addresso)];ts = [ b i b : (tit&, ( a u t h o r s ) + , publisher7, prima, yearo)];

ts = [ a u t h o r : (jirsto, la&o)];t7 = [ p u b l i s h e r : (nameo, em&)];ts = [ m o v i e : (Wee, yeare, (directore)*, featwings, genree)];tg = Veaturing : ((credit-actoro)*, (actorl4)*)];tjo = [ l is t : (mouiell)*];tll = [ m o v i e : (tit&, ~ear,~, directed-byla, genres~s, featuringo)];tjr = [ d i r e c t e d - b y : (dimtwo)*];t,s = [ g e n r e s : (genreo)*];t24 = [ a c t o r : (firstnameo, lastnameo)];

A DTD class is a set of similar DTD schemas. Since wemake no assumption that the input DTDs must describethe same domain, the input DTDs may describe drasticallydifferent domains. Therefore, DTDs need to be clusteredinto classes of similar domains so that it is meaningful forthe system to derive an integrated view. This task is the goalof the DTD clustering in our approach. Given the DTDs inTable 1, the output should be two classes. One class containsCOOKBOOK and BIB DTDs, and the other contains MOVIE andLIST DTDs.

2.2 From Tree Grammar to Integrated SchemaGiven a DTD class, schema learner generates a tree au-

tomaton that describes DTDs (as trees) in the input DTDclass. The corresponding tree grammar of the tree automa-ton generates an infinite language that expresses all inputtrees in the DTD class. The learned rules will then be opti-mized by minimizer, which will insert, the wildcard symbol$ to the tree grammar for common labels and subtrees.

A tree grammar M = (&, N, 6, F) is a quadruple: a seeQ of states, a set Jz/ of labels, the state-transition function6, and a set F of final states, T E &.

Since M reduces frontier nodes first and the root nodelast, it is also known as a frontier-to-root automaton. Wedefine 6, which operates on trees of depth 0 or 1 as follows:

l a tree which consists of single node labeled by a (i.e.,tree primitive) belongs to a state q, denoted as S, = q.

l a tree with m subtrees ql, q2,. . . , qm and the root, la-beled by a belongs to a state q, denoted as 6,(q1, q2,. . . , pm)

= q, where a is a regular expression over N U {$} and

1 5 3

Ql,QZ,... ,Pm are also regular expressions over eachsubtree. The wildcard $ matches any label.

For example, a state transition function &irstn- = q1 gen-erates a single node tree with label firstname and belongsto state 91. If a tree has a root node labeled by author andtwo child subtrees, q1 and qa, and it belongs to state qe, thenit is represented as Sauthor(ql, qz) = qe.

An integrated schema is a DTD schema generated fromoptimized state-transition functions of a tree grammar. Wedefine a mapping function 6 from a state-transition functionto a type definition in a DTD schema as follows:

l if 6, = q is a function for starting states, then e(& = q)maps to to type with character value.

l if&(ql,q2,... , qm) = q is a function for internal states,then e(&(ql, 42,. . . , qm) = q) is a type t whose label isa and !&pe(a) is a regular expression over labels of qiwith type id of qi as the subscription, where 1 5 i 5 m.

2.3 Integrated View and Source DescriptionBefore we proceed to describe the integrated view, we

provide formal criteria against which the integrated schemashould be evaluated. The first one, called coverage, guaran-tees that the integrated schema derives all DTD schemas inthe input DTD class.

DEFINITION 1. Given a regular expression R over somealphabet C, let L(R) denote the regular language defined byR. We assume that L(#PCDATA) is a subset of any regularlanguage. Let D, = {. . . , t,, . . .} be a DTD schema in theDTD cZass C, D, = {. . . , t,, . . . ) be the integrated schemafor C, and 1, and 1, are labels oft, and t,, respectively. Wesay that D, covers C if there exists a mapping u from typesof each D, in C to types in D, such that:

l if L(Z,) C L(Z,) and L(Tkpe(l,)) C L(Tkpe(15)), thenCT(&) = t,.

l if all types in ~(0~) = {.., u(&), ..} can construct aDTD schema. 0

Secondly, we require that the integrated schema be com-pact. Intuitively, the integrated schema should be the small-est type set that covers the input DTD class so that similartypes in different DTDs should be mapped to the same typein the integrated schema.

DEFINITION 2. An integrated schema D, is more compactthan an integrated schema 0: if D, covers given DTD classwhich is covered by D’, and has a smaller type set than 0:.An integrated schema D, is the most compact one for a givenDTD class if there is no integrated schema 0: more compactthan D,. 0

An integrated view is an XML-QL query template trans-formed from the most compact integrated schema. Differentorderings of the body statements are considered equivalent.We define a mapping Y from a type in the integrated schemato a body statement as follows:

l if t = [label : (tl , . . . , tm)] is a type with m sub-elements, then u(t) maps to “(1abel)pl . . .p,,,(/)” where

,pln are statements of types tl, . . . , t,, resp.l 2;’ 1 [label : (#PCDATA)] is a type with no sub-

element, then u(t) maps to “(label) $var (/)“.In the case of complex element types, the type of the label(i.e., ‘Pype(label)) is mapped to a sequence of statements.But in the case of atomic element types, it is mapped to

a variable. If the label is a wildcard (‘?I?‘), we denote thestatement as “( *)“, which means it matches any element.The result of the query is defined by the result statements,following the specification of XML-QL. Note that variablesin XML-QL are preceded by %” symbol.

Given a query based on the integrated view, the systemmust translate it into subqueries against the related sourceDTDs with assigned URLs. The source description for eachDTD is generated from types in o-l (DC) (the inverse func-tion of u) in Definition 1. The mapping is the same as ufrom a type in each DTD schema to a body statement withthe following additional condition:

l if t = [label : (tl,. . , tm)] is a type with m sub-elements and sub-elements ti+l, . . . , t,, 0 < i+l 5 m,are types corresponding to XML attributes, then u(t)_-maps to “( labelpi+ . . .&)pl . . . pi(/)“, where pi+r, . . . 1 ,p,are statements of the form: “label, = $var” and label,is the label of corresponding type.

For example, ts = [bib : (titlee, year-o)] will be transformedto a pattern “(bib year= $year)(tatle) $title (/)(I)“.

Finally, we give a model of XML view inference problem.View inference is a function from a set of DTD schemas(ina DTD class) to an integrated view derived from the mostcompact integrated schema that covers the given DTD schemas,and source descriptions for the given DTD schemas.

3. VIEW INFERENCE SYSTEMIn this section, we describe the details of our approach to

XML view inference in the data-flow order.

3.1 RenamerRenamer as a preprocessing step is an optional module

that requires human intervention. The internal nodes inXML DTDs offer both naming and structural hints for thesystem to associate related elements in different DTDs, whileleaf nodes offer very limited information to the system. Re-namer module is designed to allow human users to pro-vide additional hints for the system to associate related leafnodes. In the case of leaf nodes, the element name can berenamed manually to another internal/leaf element namein different DTDs so that they will be considered to sharethe same underlying concept. For example, in Example 2,element name first in ts may be changed to f irstname.

3.2 DTD CLUSTERThis module contains three steps to complete clustering a

collection of source DTDs into several DTD classes. As thefirst step, we merge element types. The purpose is to reducethe number of element types as well as the distance betweenDTD trees, so that DTDs of similar domains can have abetter chance to be clustered together. We also describethe method of computing distance between DTD trees andclustering method in detail.

3.2. I Type MergingThe type merging method is a preprocessing step before

clustering. First, we consider types that have the same label.If ti and tl in two different DTDs have the same label, thenwe compare their sub-element lists. Let Subi/Subj be the setof k/Q’s sub-elements, respectively. If Subi and Subj havecommon elements, then these two types are merged and theresult type definition consists of the common label and the

1 5 4

union of Sub, and Subj. If one of two types is an atomic type,it will be merged to another. Case 3 in Table 4 is resolved inthis step. Occurrence indicators (7, *, +) in type definitionsare merged according to the precedence order: 7 < * < +.For example, author+ and author* are merged to author*.

EXAMPLE 3. Given the type set l- in Ezample 2, the merged*me set F is oenemted (IS follows:

and types tr ond t7 to s,. An atomic type tfeaturing :#PCDATA] is merged to SB. Based on the merged types,we mod& the input DTD schemas. Let r”’ be the result-ing merged type set. For each root type (mot element) inr”‘, the set of all reachable types will constitute the modifiedDTD schema. Continuously in Table 1, all DTD schemasare wdefined according to the merged type set T”‘. For euample, nO”IE DTD schema will be re&Jined with Jive types,ST, sa, sm, 611, and s,e. 0

3.2.2 Tree MatrixIn order to cluster a collection of source DTDs, a dis-

tame measure quantifying the degree of association betweenDTDs is necessary. We use a generic matching method be-tween two trees proposed by Lu [14] as the distance measure.Basically, the idea ia to compute the distance by calculatingthe minimum number of modifications required to transformthe input tree into a reference tree. We extend Lu’s algo-rithm to compute the distance between two labeled trees.

There are five types of operations, (l)parent-child split-ting, (2)sibling splitting, (3)parent-child merging, (4)siblingmerging, and (5)substituting, aa illustrated in Figure 3.

Figure 3: The five types of operations on tree nodes

To explain the algorithm clearly, we need to introducesome definitions and terminologies us follows (following [14]):

l All nodes in a tree T are labeled with postfix-ordered.. a/T represents the subtree of a tree T at node a.l rr(a, b) represents a region in a tree T, where (1 5 b,

which is a set of nodes of T such that zz is a node in~(0, b) if a 2 z 5 b.

l The n regions in Y, n(pi, r;), for i = 1,. , n are saidto be consistent if the following holds true:

1. ~@i,ri) and w(pj,rj) do not overlap for 1 5i,j<n,r;<p,.

2. Any node in w(p;, T;) does not have predecessor-descendant relation with any node in n(pj,rj),i#j.

The tree matching algorithm performs the operations inan efficient way by a divide-and-conquer strategy. In “di-vide”, the matching between a subtree ojX and a region ofY is reduced to the problem of matching the children of thesuhtree a/X to each one of the regions of Y such that theyare consistent. This process is illustrated in Figure 4 wherethe n regions are consistent in the tree Y.

Figure 4: Tree matching method

With all the possible matchings between subtrees a/Xand regions of Y identified, the last problem is to find theminimum subtree which covera all the n regions in Y. The“conquer” part resolves this problem. First, let s be thenearest cmnn~n predecessor for the n regions. The secondstep is to compute the minimum number of operations tomatch a/X into the subtree s/T in Y. The distance fornfpl, vi), i = 1,. , n can now be computed according tothe following equation:

n

1

nd=xdi+ ]w@~,r..-~ln@<,ri)l +(s-r,+l).

i=1 i=, >In this equation, the first term is the total distance of map-ping from n subtrees of a/X to n regions n@i,rz), i =1,. ,?I. The second term is the cost of merging needed toremove the nodes not belonging to the n regions in T(PI, rn)(the pale gray part in Figure 4). Finally, the third termis the number of merging operations used to remove nodesfrom P, to s excluding r,(the dark gray part in Figure 4).

3.2.3 Hierarchical Clustering MethodWe employ a hierarchical clustering method [16], which is

used widely in information retrieval. The basic idea is: ini-tially start with a separate class for each DTD; successivelymerge the classes closest to one another, until the numberof classes is sufficiently small. We use the average distancefor computing the distance between classes.

The hierarchical clustering method computes a hierarchyof classes whose leaves are individual DTDs and internalnodes corresponding to classes formed by merging classesfrom lower levels.

3.3 Schema LearnerOur technique for generating integrated schema is based

on tree grammar inference. Gmmmotical inference is thetask to induce hidden grammatical rules from a set of ex-amples. In a tree grammar, its terminal alphabet is corn-posed of a set of primitive trees, then the induced rules willproduce trees by assembling the primitive elements. The

problem of deriving an integrated schema from similar DTDschemas is reduced to this problem. We first describes thetree grammar inference framework. We then show how totreat integrated schema generation as an inference task.

We adopt the k-follower method [9], which applies a sim-ple heuristic of state-merging process. We say that twostates of a finite automaton whose last k words match havea tail in common. The two states are k-equivalent where kis a nonnegative integer.

Some additional definitions and terminologies for trees aregiven as follows (following [9]):

l Z’(a += V) is the replacement of the subtree of T ata with a subtree U.

l Depth*(a) is the number of nodes on the path fromthe root of T to a, excluding a.

DEFINITION 3. Let S be a given finite set of trees, S&,be the set of all subtrees of the member trees in S, and S bea union of S and S&,. Also, let k be a nonnegative integerand “$” be a special character not in the set of node labelsin S. The k-follower Hi(T) of a tree T with respect to S isdefined by

H;(T) = {U(b ti $)}

where tree U and node b satisfy the following conditions:l U E S and bfU = T;l If there exists a tree U E S, then Depthu(b) 5 k, or if

there exists a tree U E S&,, then Deptho(b) = k. 0Hence, the k-follower of tree T is the trees that are thereplacements of T in each tree U in S with $.

EXAMPLE 4. Given a set S of two trees, let T be a singlenode tree labeled j%a, the k-follower of T is shown in Fig-ure 5. The label author is abbreviated to a, authorname toan, firstname to fn, and lastname to in. Cl

Figure 5: Example of k-follower set

We define the equivalence relation between trees based onthe concept of k-follower.

DEFINITION 4. Let R” be an equivalence relation on treesins. (T,U)ER” iflHi(T)=Hj(U)fork>O. 0

Given a finite set S of trees and any nonnegative integer k,the tree grammar inference algorithm generates a nondeter-ministic tree automaton. At first, the automaton containsone state for each subtree of S. The initial equivalence rela-tion consists of the distinct subtrees in S. The equivalenceclasses are induced from the equivalence relation and defined

as follows: [TIRh = {U E SI(T,U) E R”}. The equivalenceclasses will become the states of a nondeterministic tree au-tomaton M. If subtrees of S not belonging to the sameequivalent class have the same set of k-follower, then theirclasses are merged to a new one. The set of final states LFis a set of equivalence relations whose k-follower has $ as anelement, i.e., F = {[TIRhI$ E Hi(T)}. Formally, our treegrammar inference algorithm is given as follows:

ALGORITHM 1. Given a set S of source DTDs,Step 1. Generate the set S and initialize k to 0.Step 2. For each subtree T an S, generate the k-follower with

respect to the set S.Step 3. If Hi(T) = H;(U), then the states of the automaton

corresponds to the same equivalence class.Step 4. If the equivalence classes have changed, then go to

Step 2 with k increased by 1. Otherwise go to Step 5.Step 5. Generate state-transition functions.

EXAMPLE 5. Suppose we are given source DTDs, (a) and(b) in Table 1. The set S has two DTDs and the generatedset S has 15 subtrees. The Al orithm 1 is terminated when k= 2, because its equivalence c asses are the same with k = 1.9The inferred tree automaton is M = ({F, 91, . . . ,qls), N,6, {F)} where the state transition functions are as follows.The corresponding tree grammar is shown an Figure 6. 0

Figure 6: Learned tree grammar

3.4 MinimizerThe minimizer subsystem optimizes the state-transition

functions generated by the schema learner subsystem andtransforms the optimized functions into an integrated schema.The optimization strategy is to merge/modify functions thathave parent-child relationships or common labels/subtrees.The algorithm is given as follows:

ALGORITHM 2. Given a set P of states,Step 1. P’ = P,Step 2. For every two states pi,pj E P, (1) if they have com-

mon root label and pi is a subtree of pj, then the rootlabel of pj is changed to $. (2) if they have commonroot label and subtrees, then merge their labels and sub-trees. (3) if they have the same subtrees, then mergetheir labels.

Step 3. If P’ # P, go to Step 1. Otherwise return states.

Case 1 in Table 4 is the case of Step 2(l) in Algorithm 2. Step4 will resolve Case 2. The different combinations of methodsof Step 2 generate several sets of states. The most compactintegrated schema is selected as the integrated view.

1 5 6

We tested DEEP on three domains, book, play and movie-list. The test DTDs are prepared as follows. We started bycollecting two to three seed DTDs in the test domains. Theseed DTDs serve as the “golden rule” for performance evalu-ation. From these seed DTDs, we constructed 100 DTDs foreach domain. These 100 DTDs were generated by variousperturbation using tree modification operations in Figure 3with different modification rates. The modification rate is

sp = [cookbook/bib : (title(l, (%,)*,p”blisherg, $!ea,o, (isho)?. (priceo)?),i the ratio of the number of nodes that are modified and the.I = p : ((a”thor,a”thor”ame)?)1; total number of nodes in a given tree. The modificationi* = [a”thor,a”thornamP : ((firat”ame,firat)o, (laatnome,lost)a),; rate starts from 10% and is increased bv 10% in each iter-,g = [publisher : (nonleo, (addresaa)?, (smoilo)?,];2

EXAMPLE 8. Consider the Given query in Table 5(a) thatrettieves titles of books/articles whose publisher is Addi-son Wesley and the publishing year is later than 1998. Inthis case, suppose there are two related XML documents,COOKBOOK.xml with CCIOKBOCIK DTD and BIB.xml with BIBDTD. The system refomulates the query into two XML-QLqueties in Table 5(b) and (c). Case 4 in Table 4 is resolvedin this step. 0

ation until it reaches 100%. Each it&ion generates tennew DTDs by modifying seed DTDs. The modification isconducted by applying a randomly selected operator to therandomly selected node. Each data set has tested in twocases: with or without renomer described in Section 3.1.

Table 6: Summary of experimental results: mod:modification rate; t: number of types; mt: nun-her of merged types; pw.: average precision ofclustering; pen.: number of being renamed elemerits/number of similar concepts; accu.: accuracy,in terms of percentage.

4. EXPERIMENTAL RESULTSWe have implemented a system called DEEP based on

our approach and conducted some preliminary experiments.The main concern of the experiments is bow to evaluate thequality of the results.

Figure 7: Quality of DEEP

The experimental results are shown in Table 6. The firstperformance measure is the coorrectness of the clustering.We compute the precision of clustering, pre,, where n is aclustering level (i.e., the number of classes). The precisionpre, is obtained by the following equation:

where ~CD* 1 is the number of correctly clustered DTDs inCi and IC;l is the number of DTDs in a DTD class Ci.The fourth and eighth columns of Table 6 show the averageprecision of clustering without and with renamer, resp., atn = 3. As the modification rate increases, the precisiondegrades gracefully from 100% to 75% with the renamer,as shown in the dotted line of Figure 7 (a). Without theprocess, we see that the degradation is 38% (from 100% to62%), 13% worse than with the renamer, as shown in thestraight line of Figure 7 (a).

The second measure is the accumcy of integrated schema.The result wan achieved without DTD Cluster. The 6ftb andtenth columns of Table 6 show the accuracy for the systemto correctly 6nd out similar concepts. In both columns, a/bmeam that a is the number of similar concepts discoveredby the system and b is the total number of similar conceptsin the data set. Figure 7(b) shows accuracies ranging from50% to 18% without renamer and ranging from 100% to 62%with renasuer. In the figure, %mnaming is the percentage ofthe accuracy achieved by rammer. The shaded area of Fig-ure 7(b) shows the additional associations identified by thesystem. This illustrates that, with help of renamer, DEEPcan recognize up to 25% more of associations.

We now analyze some cases that DEEP fails to find as-sociations between similar concepts. First, if two elementtypes have no common labels and child nodes, they cannotbe matched to the same type, though they are conceptuallysimilar. Another problem is that users may provide mislead-ing renaming that confuses our system. To solve this kind

157

of problems, we provide a tool to correct minor mistakesmanually in the DEEP system.

5. RELATED WORKThe most closely related work is a system that learns map-

pings between source schema and the integrated schema,called LSD [5]. Given an integrated schema, first a set ofdata sources have been manually mapped to the integratedschema. Then the system learns from these mappings topropose mappings for new data sources. Since the focus ofLSD is on how to find one-to-one mappings for leaf elements(i.e., #PCDATA type) of source schema, it can not learn hi-erarchical or nested structures of source schema. Anotherdifference between LSD and our work is that the integratedschema of their system is given by human, while ours isgenerated dynamically and automatically by the system.

Another related work is XTRACT [lo], a system that ex-tracts a DTD from XML documents. Input XML documentsare assumed to conform to the same DTD. From each ele-ment in XML documents, XTRACT tries to derive a regularexpression that describes the sub-element sequences for theelement. Since DTDs are not mandatory, an XML docu-ment may not always have an accompanying DTD. Toolsthat can infer an accurate DTD for given XML documentsare useful. It is straightforward to extend our system toextract a DTD from XML documents using schema learnersubsystem. In this case, the set of sample trees consists ofXML documents and the inferred rules will generate a DTDwhich can cover all input documents.

Extracting schema from semistructured data is consideredin [15]. This work focuses on finding a typing for semistruc-tured data. They use an approximately typing method tomerge types. Therefore, the generated types are expressedusing incoming and outgoing labeled edges. The extractedschema is plain sequences of types instead of arbitrary reg-ular expressions.

6. CONCLUSIONS AND FUTURE WORKIn this paper, we have proposed a view inference approach

that automatically derives integrated view for an IIA to ac-cess XML-based sources. This approach enhances the ex-tensibility of an IIA. This problem arises because manuallyconstructing an integrated view for each application domainis error-prone and labor-intensive. To make the developmentof an extensible IIA more efficient, it is desirable to have atool that automates this task.

Our solution to this problem is a novel approach that ap-plies hierarchical agglomerative clustering and tree grammarinference technique to generate an integrated view so that anIIA can integrate a new XML document source easily. Theexperimental results show that our method can effectivelyand efficiently generate appropriate integrated view for thetest domains. The performance is further improved with re-namer. We conclude that our view inference approach is afeasible solution to alleviate engineering bottlenecks for thedevelopment of extensible IIA.

Our future work includes applying this approach to large-scaled real-world applications. We are participating in aproject aiming at building a digital museum of historicalphotographs in Taiwan. In this project, we need to inte-grate a variety of photograph collections maintained inde-pendently by a variety of agencies. We also planned to in-vestigate how to minimize human intervention in renamer.

AcknowledgementsThe research reported here was supported in part by theNational Science Council in Taiwan under Grant No. NSC89-2218-E-002-014, 89-2750-P-001-007, and 89-2213-E-001-039.

7.PI

PI

[31

PI

[51

PI

[71

PI

PI

PO1

WI

WI

[I31

P4

[I51

P61

REFERENCEST. Bray, J. Paoli, and C. M. Sperberg-McQueen.Extensible Markup Language(XML) 1.0, 1998. W3CRecommendation.P. Buneman, S. Davidson, G. Hillebrand, andD. Suciu. A query language and optimizationtechniques for unstructured data. In Proceedings ofSIGMOD, 1996.S. Chawathe, H. Garcia-Molina, J. Hammer,K. Ireland, Y. Papakonstantinou, J. Ullman, andJ. Widom. The TSIMMIS project: Integration ofheterogeneous information sources. In Proceedings ofthe Information Processing Society of JapanConference, pages 7-18, Tokyo, Japan, October 1995.A. Deutsch, M. Fernandez, D. Florescu, A. Levy, andD. Suciu. XML-QL: a query language for XML, 1998.A. Doan, P. Domingos, and A. Levy. Learning sourcedescriptions for data integration. In 3rd InternationalWorkshop on the Web and Databases, 2000.0. Duschka and M. Genesereth. Query planning ininfomaster. In Proceedings of the ACM Symposium onApplied Computing, San Jose, CA, February 1997.0. Etzioni and D. Weld. A softbot-based interface tothe Internet. In C. ACM, 1994.M. Fernandez, J. Simeon, and P. Wadler. XML query1anguages:experiences and examplars, 1999. W3CDraft manuscript.H. Fukuda and K. Kamata. Inference of tree automatafrom sample set of trees. International Journal ofComputer and Information Sciences, 13:177-196, 1984.M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri,and K. Shim. XTRACT: a system for extractingdocument type descriptors from xml documents. InProceedings of the ACM SIGMOD, 2000.T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivsstava. Theinformation manifold. In Proceedings of the AAAISpring Symposium on Information Gathering inDistributed Heterogeneous Environments, Stanford,California, March 1995.C. A. Knoblock, Y. Arens, and C. N. Hsu.Cooperating agents for information retrieval. InProceedings of International Conference onCooperative Information Systems, 1994.C. Kwok and D. Weld. Planning to gatherinformation. In Proceedings on 13th NationalConference of AI, 1996.S. Y. Lu. A tree matching algorithm based on nodesplitting and merging. In IEEE Transactions onPattern Analysis and Machine Intelligence, volume 6,pages 249-256, 1984.S. Nestorov, S. Abiteboul, and R. Motwani.Extracting schema from semistructured data. InProceedings of the ACM SIGMOD, pages 295-306,Seattle, June 1998.E. Rasmussen. CIusterang Algorithms, chapter 16.Prentice Hall, 1992.

1 5 8

Documents

Induction of Integrated View for XML Data with ... · Induction of Integrated View for XML Data with HeterogeneousDTDs Euna Jeong Computer Science and Information Engineering National