Download ppt - Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi

Computing Structural Similarity of Source XML Schemas

against Domain XML Schema

Jianxin Li1 Chengfei Liu1 Jeffrey Xu Yu2

Jixue Liu3 Guoren Wang4 Chi Yang1 1Swinburne University of Technology

2Chinese University of Hong Kong

3University of South Australia

4Northeastern University of China

2

Outline

Motivation Related Work Problem Statement Structural Similarity Model Algorithms Experiments Conclusions and Future Work

3

XML has become the standard for representing, exchanging and integrating data on the web.

Different source providers may define different schemas for their data based on different applications.

When exact results do not exist, approximate results are also expected to be returned.

Motivation

faculty campus

cname

bookstu

prof

*

*

*

**

depart* lib

university

faculty campus

cname book

stu

prof

*

*

**

depart*

lib

university

*

Fig.1 Schema of 1st Source S1 Fig. 2 Schema of 2nd Source S2

4

Users may issue queries based on their common understanding, i.e., domain schema.

For example:

Motivation

dept lib

cname bookstu prof

uni

* * *

**faculty campus

cname

bookstu

prof

*

*

*

**

depart* lib

university

Fig. 3 Domain Schema T

The domain schema doesn’t match the both source schemas. To efficiently return approximate results, it is desirable for system to determine which source schema much more similar to the domain schema.

Brief XPath queries:

Q1: uni[swin]/dept[ICT]/prof;

Q2: uni[swin]/lib[./cname[Hawthorn]]/book;

… …

How to compute the similarity between domain schema and source schemas?

faculty campus

cname book

stu

prof

*

*

**

depart*

lib

university

*

5

Related Work Measuring the similarity between XML documents – To cluster XML documents.

Edit Distance - detecting the required changes from one XML document to another, such as re-labeling, deleting, and inserting.

Similar to Edit Distance, Binary tree – XML documents can be represented as the tree-structured data. And then the similarity can be obtained by comparing the binary trees.

Time series - each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies, they can state the degree of similarity between documents.

Measuring the similarity between XML schemas – To derive schema matching, schema mapping or schema integration. Cupid, XClust and Similarity Flooding proposed a structural match algorithm

where they only emphasized the name and data type similarities presented at the leaf level.

COMA the similarity between the elements was recursively computed from the similarity between their respective children with a leaf-level matcher.

In summary, the above methods will compute the similarity in a symmetric way.

6

Related Work Example of Binary tree model BiBranch where the smaller the BiB value

is, the more similar its corresponding pair of trees are.

According to the above computation, T2 is more similar to T0 than others. We have a sorted list: T2 > T1 = T3 = T4 .However, it is not correct in query applications.

Fig.4 Example of BiBranch model

The symmetric similarity model

cannot satisfy query needs!!

7

Problem Statement Given a domain schema tree T0 =(V0,E0, vr0,Card) and a

source schema tree T = (V,E, vr,Card), we need to compute their structural similarity distance SSD(T0, T).

An XML schema tree is defined as T = (V, E, vr, Card) where

V is a finite set of nodes, representing elements and attributes of the schema.

E is a set of directed edges.

vr V is the root node of tree T.

Card: V → {“1”, ”*”}.

8

Problem Statement In this work, we will focus on more different aspects:

The purpose of similarity computation is to choose a similar data source for queries.

The similarity computation is asymmetric where the schema conformed by users’ queries is taken as domain schema.

We concern the parent-child (PC) and ancestor-descendant (AD) relationships, rather than the sibling order because they are important in formulating a query.

We take into account the cardinality of schema elements. An index based on encoding schema is provided to improve the

efficiency of computation.

9

Structural Similarity Model The model takes into account three factors: element coverage,

consistency of element pair relationships and the difference of element cardinality.

Ratio of Interesting Object:

Cardinality similarity of node pairs:

where V ’ = V V0 is the set of interesting nodes in V. 0

0

'),(V

VVVRIO

),(),(,1

),(),(,),,,(

020121

020121020121 vvRCardvvRCard

vvRCardvvRCardvvvvCSNP

10

Structural Similarity Model

)),,,(1

(),(),( 0201212

||

00

0

vvvvSNPVVRIOTTSSDC VV

4

3

2

1

,0

,1

),,,,(

),,,,(

),,,( 020121

020121

020121

case

case

case

case

vvvvCSNP

vvvvCSNP

vvvvSNP

Similarity of node pairs: SNP(v1,v2,v01,v02)

Similarity of source schema w.r.t. domain schema SSD(T0,T)

.:4

),(),(:3

))/()//(())//()/((:2

))//()//(())/()/((:1

210201

210201210201

210201210201

otherscase

vveSiblingvveSiblingcase

vvvvorvvvvcase

vvvvorvvvvcase

11

Structural Similarity Model Comparison of SSD and BiBranch models:

BiBranch model:

T2 > T1 = T3 = T4

T1 = T4 > T3 > T2 The results satisfy our expectation!!!

Fig.5 Example of SSD model

12

Algorithms Techniques:

Trimming rules: Root node, Leaf node, Internal node

Numbering scheme as index: pre – preorder, post – postorder, C – Cardinality, P – parent, RD - Rightmost descendant’s preorder.

Algorithms: Basic Algorithm (BA): Conducting pair wise comparisons. Improved Algorithm (IA): Reducing the number of similarity comparisons.

13

Experiments Response Time vs. Similarity Degree

Fig. 6 The schema size varies from 20, 40, 60 and 80 nodes respectively. At the same time, we adjust the similarity degree from 25%, 50%, 75% and 100% respectively.

(b) schema size = 40 nodes(a) schema size = 20 nodes

(c) schema size = 60 nodes (d) schema size = 80 nodes

14

Fig.7 Schema size is 128 nodes and the level varies from 4 to 16.

Experiments

Response Time vs. Nested Level Speedup vs. Fanout

Fig.8 the schema size is set 128 nodes

and the fanout varies from 2 to 5.

15

Fig.9 the schema size varies from 20, 40, 60, and 80 nodes.

Experiments Response Time vs. Schema Size

Fig.10 The three public datasets: TPC-H-nested.xsd (17), genexml.xsd (85) and mondial-3.0.xsd (120).

16

Conclusions and Future Work Contributions:

Proposed structural similarity problem for the purpose of query application; Designed a brief structural similarity model and discussed its effectiveness; Implemented relevant algorithms and demonstrated its efficiency with synthetic

and real data sets. Future work:

Improve the similarity model and make it more accurate; Apply this similarity model to improve query evaluation.

17

Thanks & Question