Computing Structural Similarity of Source XML Schemas
against Domain XML Schema
Jianxin Li1 Chengfei Liu1 Jeffrey Xu Yu2
Jixue Liu3 Guoren Wang4 Chi Yang1 1Swinburne University of Technology
2Chinese University of Hong Kong
3University of South Australia
4Northeastern University of China
2
Outline
Motivation Related Work Problem Statement Structural Similarity Model Algorithms Experiments Conclusions and Future Work
3
XML has become the standard for representing, exchanging and integrating data on the web.
Different source providers may define different schemas for their data based on different applications.
When exact results do not exist, approximate results are also expected to be returned.
Motivation
faculty campus
cname
bookstu
prof
*
*
*
**
depart* lib
university
faculty campus
cname book
stu
prof
*
*
**
depart*
lib
university
*
Fig.1 Schema of 1st Source S1 Fig. 2 Schema of 2nd Source S2
4
Users may issue queries based on their common understanding, i.e., domain schema.
For example:
Motivation
dept lib
cname bookstu prof
uni
* * *
**faculty campus
cname
bookstu
prof
*
*
*
**
depart* lib
university
Fig. 3 Domain Schema T
The domain schema doesn’t match the both source schemas. To efficiently return approximate results, it is desirable for system to determine which source schema much more similar to the domain schema.
Brief XPath queries:
Q1: uni[swin]/dept[ICT]/prof;
Q2: uni[swin]/lib[./cname[Hawthorn]]/book;
… …
How to compute the similarity between domain schema and source schemas?
faculty campus
cname book
stu
prof
*
*
**
depart*
lib
university
*
5
Related Work Measuring the similarity between XML documents – To cluster XML documents.
Edit Distance - detecting the required changes from one XML document to another, such as re-labeling, deleting, and inserting.
Similar to Edit Distance, Binary tree – XML documents can be represented as the tree-structured data. And then the similarity can be obtained by comparing the binary trees.
Time series - each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies, they can state the degree of similarity between documents.
Measuring the similarity between XML schemas – To derive schema matching, schema mapping or schema integration. Cupid, XClust and Similarity Flooding proposed a structural match algorithm
where they only emphasized the name and data type similarities presented at the leaf level.
COMA the similarity between the elements was recursively computed from the similarity between their respective children with a leaf-level matcher.
In summary, the above methods will compute the similarity in a symmetric way.
6
Related Work Example of Binary tree model BiBranch where the smaller the BiB value
is, the more similar its corresponding pair of trees are.
According to the above computation, T2 is more similar to T0 than others. We have a sorted list: T2 > T1 = T3 = T4 .However, it is not correct in query applications.
Fig.4 Example of BiBranch model
The symmetric similarity model
cannot satisfy query needs!!
7
Problem Statement Given a domain schema tree T0 =(V0,E0, vr0,Card) and a
source schema tree T = (V,E, vr,Card), we need to compute their structural similarity distance SSD(T0, T).
An XML schema tree is defined as T = (V, E, vr, Card) where
V is a finite set of nodes, representing elements and attributes of the schema.
E is a set of directed edges.
vr V is the root node of tree T.
Card: V → {“1”, ”*”}.
8
Problem Statement In this work, we will focus on more different aspects:
The purpose of similarity computation is to choose a similar data source for queries.
The similarity computation is asymmetric where the schema conformed by users’ queries is taken as domain schema.
We concern the parent-child (PC) and ancestor-descendant (AD) relationships, rather than the sibling order because they are important in formulating a query.
We take into account the cardinality of schema elements. An index based on encoding schema is provided to improve the
efficiency of computation.
9
Structural Similarity Model The model takes into account three factors: element coverage,
consistency of element pair relationships and the difference of element cardinality.
Ratio of Interesting Object:
Cardinality similarity of node pairs:
where V ’ = V V0 is the set of interesting nodes in V. 0
0
'),(V
VVVRIO
),(),(,1
),(),(,),,,(
020121
020121020121 vvRCardvvRCard
vvRCardvvRCardvvvvCSNP
10
Structural Similarity Model
)),,,(1
(),(),( 0201212
||
00
0
vvvvSNPVVRIOTTSSDC VV
4
3
2
1
,0
,1
),,,,(
),,,,(
),,,( 020121
020121
020121
case
case
case
case
vvvvCSNP
vvvvCSNP
vvvvSNP
Similarity of node pairs: SNP(v1,v2,v01,v02)
Similarity of source schema w.r.t. domain schema SSD(T0,T)
.:4
),(),(:3
))/()//(())//()/((:2
))//()//(())/()/((:1
210201
210201210201
210201210201
otherscase
vveSiblingvveSiblingcase
vvvvorvvvvcase
vvvvorvvvvcase
11
Structural Similarity Model Comparison of SSD and BiBranch models:
BiBranch model:
T2 > T1 = T3 = T4
T1 = T4 > T3 > T2 The results satisfy our expectation!!!
Fig.5 Example of SSD model
12
Algorithms Techniques:
Trimming rules: Root node, Leaf node, Internal node
Numbering scheme as index: pre – preorder, post – postorder, C – Cardinality, P – parent, RD - Rightmost descendant’s preorder.
Algorithms: Basic Algorithm (BA): Conducting pair wise comparisons. Improved Algorithm (IA): Reducing the number of similarity comparisons.
13
Experiments Response Time vs. Similarity Degree
Fig. 6 The schema size varies from 20, 40, 60 and 80 nodes respectively. At the same time, we adjust the similarity degree from 25%, 50%, 75% and 100% respectively.
(b) schema size = 40 nodes(a) schema size = 20 nodes
(c) schema size = 60 nodes (d) schema size = 80 nodes
14
Fig.7 Schema size is 128 nodes and the level varies from 4 to 16.
Experiments
Response Time vs. Nested Level Speedup vs. Fanout
Fig.8 the schema size is set 128 nodes
and the fanout varies from 2 to 5.
15
Fig.9 the schema size varies from 20, 40, 60, and 80 nodes.
Experiments Response Time vs. Schema Size
Fig.10 The three public datasets: TPC-H-nested.xsd (17), genexml.xsd (85) and mondial-3.0.xsd (120).
16
Conclusions and Future Work Contributions:
Proposed structural similarity problem for the purpose of query application; Designed a brief structural similarity model and discussed its effectiveness; Implemented relevant algorithms and demonstrated its efficiency with synthetic
and real data sets. Future work:
Improve the similarity model and make it more accurate; Apply this similarity model to improve query evaluation.
17
Thanks & Question