s Ö N Þ Ì ø m XML W ä > - academic-journals.org · 84! · X M ô 2 Abstract XML has become a standard so that the transactional processes operate well in enterprise data exchange

XML

(eXtensible Markup Language XM)

XML XMLXML

XML XML

XML(frequent tag data sets) (frequent character

data sets)XML XML

XMLXML 75%

40XML XML

Compressing the Native XML Database via Association Mining

Chin-Feng Lee Chaoyang University of Technology,

Chi-Ming Tang Chaoyang University of Technology

84

Abstract

XML has become a standard so that the transactional processes operate well in enterprise data exchange. However the existing database systems like relational databases provide inadequate facilities to manage the nested and ordered structures in XML documents. Therefore, there exist two important issues about the storage capacity for huge XML documents and the complexity mapping between the relational databases and XML repository. A native XML database is a solution to efficiently retrieve XML documents as basic units without requiring complicated transformation. Moreover, database compression is bound to relief the storage capacities. Hence, we use association mining techniques to compress a native XML database for solving the above problems. The frequent character data sets and frequent tag sets can be explored out and be applied to establish a set of database compression rules. The proposed method also applies dynamically mining techniques to maintain the compression rules without periodically decompressing and exploring the whole database compression again if there are any database updates.

The proposed approach contributes to the native XML database both in extracting hidden information and lossless compression, respectively. The experimental results show that our compression method has powerful compression effectiveness and the static compression can reach the ratio of 75%. When we apply the porposed dynamical mining techniques, we can save 40 seconds in database compression time.

Keywords: XML, Data Mining, Data Compression, Native XML Database

XML(eXtensible Markup Language) W3CXML

DTD(Document Type Definition) XML SchemaXML

(Cannane et al., 2000; Lee et al., 2001; Strobel, 2002; WWW Consortum, 2005)

XMLXML

XML(Bertino et al., 2001; Cannane

Electronic Commerce Studies 85

et al. 2000) XMLXML

(Florescu and Lossmann, 1999; Fong et al., 2003) XML

XMLXML

(Huffman Coding)(Arithmetic Encoding) (Dictionary Coding)

(Knowledge DBcovery in Databases)

XML(Agrawal and Srikant,

1994; Agrawal et al., 1993; Han and Kamber, 2001) XML(frequent tag data sets)

(frequent character data sets)

XML

XML

XML

XML XML

86

XML(eXtensible Markup Language)

XML (tag) (character data)

DTD(Document Type Definition) W3C XML

1 XML “ ” “ ”“ ” “ ” (character data)

“ ” “ ” “ ” “” “ ” “ ” “ ”

“020301” “ 2”

1 XML

(Item)

(Agrawal and Srikant, 1994)

X Y [Support , Confidence] X Y(Support) (Confidence)

Probability(X Y) (X Y)Probability(Y|X) X Y

(Minimum Support)(Minimum Confidence)


Class Inheritance Tree(CIT) 2001Apriori (Goh et al. 1998; Lee et al., 2006)

(2001) CIT(Equivalence Class)

CIT CIT

Step1 X={X1, X2, …, Xn} CIT(1) Xi

(Equivalence Class) EC={EC1, EC2, …, ECn} X

(2) CIT ECi

Step2 CIT(Decision

Equivalence Class) DECi

(Condition Equivalence Class) CECi CEC=EC-{DEC1,DEC2, …, DECn } CIT CIT

CECi DECi

Step3(1) Xi Xi = Xi

t (X1, X2, X3, …, Xi-1, , Xi+1, …, Xn ) t’ (X1, X2, X3, …, Xi-1 , Xi+1, …, Xn ) X1, X2, X3, …, Xi-1 , Xi+1, …, Xn

i R*B( ) R Xi =EC B( ) Byte

(2)

88

t (X1, …, Xj1-1, 1, Xj1+1, …, Xj2-1, 2, Xj2+1, …, Xjk-1, k, Xjk+1, …, Xn)

t (X1, …, Xj1-1, Xj1+1, …, Xj2-1 , Xj2+1, …, Xjk-1, Xjk+1, …, Xn)

X1, …, Xj1-1, Xj1+1, …, Xj2-1, Xj2+1, …, Xjk-1 , Xjk+1, …Xn i

ji R*k

1ii)(B R

B( i) i

Byte for i = 1 to k

XML2 XML

XMLXML Collections

Collection DTD (Parse)XML (2001)

XML Collection XML(Frequent Character Data Sets) (Frequent Tag Sets)

XML Collection XMLCollection

XML


2

DTD XML

XML (Hierarchical Structure) XML n-ary

( Document Tree)(Sub-element) (Parent-ChildNodes)

XDB XML (Native XML Database)XDB = {C1, C2, …, Cn} XDB n Colletions

Collection Ci XML X={X1, X2, …, Xm}XML DTD

Collection Ci DTD(Depth-First-Search DFS) XML Xi

Di-Tree (Root Node)(Tag Node) (Character Data Node)

1. XML2. q p p x

q x.yCase 1. q m {r1, r2, …, rm} r1

x.y.z1 r2 x.y.z2 Ym

90

x.y.zmCase 2. x.y

q x.y3(a) DTD XML

3(b) X1 XMLX1 “

” D1-Tree (Content) “” “1” X1 “ ”XML DTD “ ”

“ ” “1” Case 1 “ ”“ ”

“ ” “1.1” “ ”XML DTD “ ”

“ ” “1” Case 1 “ ”“ ”

“1” “ ” “1.1.1” “” “1.1.1”

XML 4 D1-Tree

XML“1”

“ ” “1” “ ”“2”

Collection XMLXML

5

3 (a)DTD

(b) XML


XML

XML

= { i1, i2, …, ik} { 1, 2, …, h}{ 1, 2, …, h} XML

EC XML

Definition 1 (Equivalence Class of Character Data)

EC = EC< i1, i2, …, ik> =

{< kiii ...,,,21 >| ji ij XML

for j = 1, 2, …, k}EC< 150> = {<2.2.3, 2.2.4>, <3.2.2,

3.2.3>} “ ” “150” 23 XML “ ” “2.2.3”

“3.2.2” “150” “2.2.4” “3.2.3”

4XML x1

D1-Tree 5

92

Definition 2 k

k (k 2)k

1 (k =1) 11 = < > 2 = < 80>

< > 1 <80> 2

Definition 3 (support)

kEC XML

|EC |

= < 80 >EC =EC< 80 > = {<1.2.2, 1.2.3>, <3.2.1, 3.2.2>}

XML 2|EC< 80 >| =2

Minimum-Support Threshold, min-sup

3.1(min-sup)

CF1 CS1

CF1 Table(1) Content 1(2) List XML

CS1 Table(1) Content 1(2) List

Apriori k (k2) CFk Table CFk Table

(1) Content k(k 2)(2) List

1 k(k 2)

CF1 Table p q p q Content


1 2 1 2 11 2 EC 1

EC 2 1 2 XMLEC 1 Most Significant Digit(msd)

EC 1 G1, G2, …, Gm1|Gm1| (say Gj j = 1, 2, …, m1)

msd(say j) 1 j XML EC 2G1’, G2’, …, Gm2’ msd(G1) = msd(G1’)

msd(Gh) = msd(Gh’) …msd(G ) = msd(G ’) h m1 h’ m2|EC< 1, 2>|= min{|G1|, |G1’|} + min{|G2|, |G2’|} + …+ min{|Gh|,

|Gh’|} |EC< 1, 2>| min-sup < 1, 2> 22

k(k 2)1 2min-sup=2

Content 1 2 1 = <bread>, 2 = <coke cola> EC 1 msd G1={1.2.1}, G2={2.2.3},

G3={4.2.2}(m1=3) EC 2 msd G1’={1.2.2}, G2’={2.2.1}(m2=2) Most Significant Digit msd(G1)=1,msd(G2)=2, msd(G3)=4, msd(G1’)=1, msd(G2’)=2 msd(G1)=msd(G1’),msd(G2)=msd(G2’) |EC< 1, 2>|=|EC<bread,coke cola>|= min{|G1|,|G1’|}+min{|G2|, |G2’|}=1+1=2 bread, coke cola

XML XMLmin-sup 2 2

CF2 Table Content “<bread, coke cola>” List “<1.2.1, 1.2.2>, <2.2.3, 2.2.1>”

2

1 CF1 1

Content List

bread 1.2.1, 2.2.3, 4.2.2

coke cola 1.2.2, 2.2.1

cheese 1.2.3, 4.2.1

2 CF2 2

Content List

<bread, coke cola> <1.2.1, 1.2.2>, <2.2.3, 2.2.1>

<bread, cheese> <1.2.1, 1.2.3>, <4.2.2, 4.2.1>

94

2 32

min-sup=2Contetnt 1 2 1=<bread, coke

cola>, 2=<bread, cheese> EC 1 msd G1={<1.2.1, 1.2.2>}, G2={<2.2.3, 2.2.1>}(m1=2) EC 2 msd G1’={<1.2.1,1.2.3>}, G2’={<4.2.2, 4.2.1>} most significant digitmsd(G1)=1, msd(G2)=2, msd(G1’)=1, msd(G2’)=4 msd(G1)=msd (G1’) |EC< 1, 2>|=|EC<bread, coke ke cola, cheese>|=min{|G1|, |G1’|}=1

<bread, coke cola, chees> XML1 min-up

XMLXML

= { i1, i2, …, ik} { 1, 2, …, h}

{ 1, 2, …, h} XML

EC XML

Definition 4 (Equivalence Class of Tag)

EC = EC< i1, i2, …, ik>{< 1i , 2i , …, ki >|

XML }

Definition 5 1

1 (k 2)1

Definition 6 (support)

1 ECXML

|EC |


(2001)

Metarule C Metarule T

CFk Table Content k(k 1)

t (P1, …, Pj1-1, 1, Pj1+1, …, Pj2-1, 2, Pj2+1, …, Pjk-1, n, Pjk+1, …, Pn)t (P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …, Pjk-1, Pjk+1, …, Pn) List

P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …Pj k-1, …Pjk+1, Pn DTD

i i

List

|List|*k

i 1

B( i) |CD| i List

B( i) i Byte for i =1, 2, ..., n i

XML

t1 , , , t1’ ,{1.1.1, 1.2.1}, {3.1.1, 3.2.1} “ ” “

” “ ” “ ”“ ” “ ” “ ”

“ ” “ ” “” “ ” “ ” “ ” “ ”

“{1.1.1, 1.2.1}, {3.1.1, 3.2.1}” List

TFk Table Content k (k 1 )

t (P1, …, Pj1-1, 1, Pj1+1, …, Pj2-1, 2, Pj2+1, …, Pjk-1, n, Pjk+1, …, Pn)

96

t (P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …, Pjk-1, Pjk+1, …, Pn) ListP1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …Pj k-1, …Pjk+1, Pn DTD

i List

|List|*k

i 1B( i) |List| i

List B( i) i Byte for i =1, 2, ..., n

(Heuristic Compression Method)

Apriori

3“ ” “ ”

“{1.1.1, 1.2.1}, {3.1.1, 3.2.1}”“ ” “ ” CD “{1.1.1, 1.2.2}, {3.1.1, 3.2.3}, {4.1.1, 4.2.1}” 1 3

“ ” “1.1.1” “3.1.1”“RID 1”

“RID 2” “1.1.1” “3.1.1”“RID 2” CD “{4.1.1, 4.2.1}” “RID 2”

3

Compression Rules List

t1

t1’{1.1.1, 1.2.1}, {3.1.1, 3.2.1}

t2

t2’{1.1.1, 1.2.2}, {3.1.1, 3.2.3}, {4.1.1, 4.2.1}

1. (Least Penalty Heuristic Method; LPHM) XML


2. (Most Penalty Heuristic Method; MPHM) XML

3. (Maximal Compression Spaces Heuristic Method; MaxCSHM)

4. (Minimum Compression Spaces Heuristic Method; MinCSHM)

Java J2SDK 1.4.1Intel P4-2.4G 1.5GB

Microsoft Windows 2000 ProfessionalDTD

6 7 XML RAR ZIP

7 DTD IBM Almaden Research Center (2005) Assoc.gen

IBM 4Assoc.gen

XML XML

RARZIP (LPHM)

(Non-Heuristic Method)

98

4 Assoc.gen

T L_TLEN D NTRANS N L_NITEMS I PATLEN R CORR

6 DTD

7 DTD

8 1000 XML 10% 20% 30% 90%1 6

DTD T20. I10. N100. R1. 8

180% 90%

190% 74.18 %

XML30%

30%

<!ELEMENT transaction (itemset)> <!ELEMENT itemset (item+)>

<!ELEMENT item (#PCDATA)>

<!ELEMENT Sale (Header, itemset)> <!ELEMENT Header (SID, name, note)> <!ELEMENT SID(#PCDATA)> <!ELEMENT name(#PCDATA)> <!ELEMENT note(#PCDATA)> <!ELEMENT itemset (item+)> <!ELEMENT item (#PCDATA)>


(%)

(

%

)

8 1000 XML

5 XML6 DTD XML 2500 5000 7500 10000

T20. I10. N0.1k. R1 30%1 2500 XML

709,462 bytes 137,364bytes 709,462 bytes72,768 bytes

122,252 bytes 60,661 bytes78,230 bytes(709,462 + 72,768)

(78,230)/(846,826 + 122,252 + 60,661)= 75.96(%) 5000 750010000 XML

75%

100

5

2500 5000 7000 10000

XML (bytes) 846,826 1683,026 2,524,412 3,371,170 (bytes) 709,462 1,410,092 2,114,996 2,824,351

(bytes) 137,364 272,934 409,416 546,819 (bytes) 709,462 1,410,092 2,114,996 2,824,351 (bytes) 72,768 140,466 210,612 289,971 (bytes) 782,230 1,550,558 2,325,608 3,114,332 (bytes) 122,252 255,522 383,270 537,114

(bytes) 60,661 122,927 184,305 265,827 (%) 75.96 75.22 75.21 74.61

(%) 75.25

RAR ZIP

9 30% T20. I10. N100. R1. XML RAR ZIP

6 DTD1 A 2

1 BRAR ZIP

404550556065707580

2500 5000 7500 10000

XML

(

%

)

9 DTD RAR ZIP

10 30% T20. I10. N100. R1.7 DTD XML RAR


ZIP1 A 2

1 B RARZIP 9 10

RAR ZIP

40

45

50

5560

65

70

75

2500 5000 7500 10000

XML

10 DTD RAR ZIP

11 30% T20. I10. N100. R1. k(k 2) 1

XML 7DTD

10000 210 574.61%

102

0

30

60

90

120

150

180

210

240

1000 2500 5000 7500 10000

XML

11

12 T20. I10. N100. R1.30% k(k 2)

7 DTD(LPHM) k(k 2)

1000 2500 5000 7500 10000 XML

LPHM

1000 2500 5000 7500 10000

XML

k(

bytes

)

12


XML

(Lossless Compression)

XML

< > “ ”“ ”

XML75% ZIP RAR

(2001) 2001 —

R. Agrawal, T. Imielinski, and A. Swami(1993), “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C., 207-216.

R. Agrawal and R. Srikant (1994), “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), 487-499.

E. Bertino and B. Catania (2001), “Integrating XML and Database,” IEEE Internet Computing, 5(4), pp. 84-88.

A. Cannane and H. E. Williams (2000), “A Compression Scheme for Large Databases,” Proceedings of the Australian Database Conference (ADC'2000), 22(2), 6-11.

S.W., Changchien and T.C., Lu (2001)“A New Efficient Association Rules Mining Method Using Class Inheritance Tree (CIT),” Proceedings of the 12th International Conference on Information Management (ICIM 2001), Twaiwn.

D. Florescu and D. Kossmann (1999), “Storing and Querying XML Data Using an RDBMS,” IEEE Data Engineering Bulletin, 22(3), 27-34.

104

J. Fong, H. K. Wong, and Z. Cheng (2003), “Converting Relational Database into XML Documents with DOM,” Information and Software Technology, 45, 335-355.

C. L. Goh, K. M. Aisaka, Tsukamoto, K. Harumoto, and S. Nishio (1998), “Database Compression with Data Mining Methods,” Proceedings of the 5th International Conference on Foundations of Data OrganiPation (FODO'98), 97-106.

J. Han and M. Kamber, (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann.

J. W. Lee, K. Lee, and W. Kim (2001), “Preparations for Semantics-Based XML Mining,” Proceedings of the IEEE International Conference on Data Mining, 345-352.

C. F. Lee, S. W. Changchien, W. T.Wang, and J. J. Shen (2006): “A Data Mining Approach to Database Compression,” Information Systems Frontier (ISF), 8(3), 147-161.

M. Strobel (2002), “An XML Schema Representation for the Communication Design of Electronic Negotiations,” Computer Networks, 39, 661-680.

IBM Almaden Research Center (2005), Quest Synthetic Data Generation, http://www.almaden.ibm.com/software/quest/Resources/dataset/syndata.html.

World Wide Web Consortum (2005), “Extensible Markup Language (XML) Version 1.1,” http://www.w3.org/TR/2004/REC-xml11-20040204/.

[email protected]

s9214612@ cyut.edu.tw