20
An Empirical Evaluation of XML Compression Tools Sherif Sakr School of Computer Science and Engineering University of New South Wales 1 st International Workshop on Benchmarking of XML and Semantic Web Applications (BenchmarX’09) 20 April 2009 S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 1 / 20

XML Compression Benchmark

Embed Size (px)

DESCRIPTION

This paper presents an extensive experimental study of the state-of-the-art of XML compression tools. The study reports the behavior of nine XML compressors using a large corpus of XML documents which covers the di erent natures and scales of XML documents. In addition to assessing and comparing the performance characteristics of the evaluated XML compression tools, the study tries to assess the effectiveness and practicality of using these tools in the real world. Finally, we provide some guidelines and recommendations which are useful for helping developers and users for making an effective decision for selecting the most suitable XML compression tool for their needs.

Citation preview

Page 1: XML Compression Benchmark

An Empirical Evaluation of XML Compression Tools

Sherif Sakr

School of Computer Science and EngineeringUniversity of New South Wales

1st International Workshop on Benchmarking of XML and Semantic Web Applications(BenchmarX’09)

20 April 2009

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 1 / 20

Page 2: XML Compression Benchmark

XML Compression: Why?

XML has become a popular standard with many useful applications.

XML is often referred as self-describing data.

On one hand, this self-describing feature grants the XML greatflexibility.

On the other hand, it introduces the main problem of verbosity.

XML compression has many advantages such as:

Reducing the network bandwidth required for data exchange.

Reducing the disk space required for storage.

Minimizing the main memory requirements of processing and queryingXML documents.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 2 / 20

Page 3: XML Compression Benchmark

XML Compressors: Classifications I

With respect to the awareness of the structure of the XML documents:

General Text Compressors: They are XML-Blind, treats XMLdocuments as usual plain text documents and applies the traditionaltext compression techniques.

XML Conscious Compressors: They are designed to take theadvantage of the awareness of the XML document structure toachieve better compression ratios over the general text compressors.

Schema dependent compressors: Both of the encoder and decodermust have access to the document schema information achieve thecompression process. They are not commonly used in practice.

Schema independent compressors: The availability of the schemainformation is not required to achieve the encoding and decodingprocesses.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 3 / 20

Page 4: XML Compression Benchmark

XML Compressors: Classifications II

With respect to the ability of supporting queries:

Non-Queriable (Archival) XML Compressors: They do not allowany queries to be processed over the compressed format. They aremainly focusing to achieve the highest compression ratio.

By default, general purpose text compressors belong to thenon-queriable group of compressors.

Queriable XML Compressors: They allow queries to be processedover their compressed formats. The compression ratio is usually worsethan that of the archival XML compressors. The main focus is toavoid full document decompression during query execution.

The ability to perform direct queries on compressed XML formats isimportant for many applications which are hosted on resource-limitedcomputing devices such as: mobile devices and GPS systems .By default, all queriable compressors are XML conscious compressorsas well.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 4 / 20

Page 5: XML Compression Benchmark

Examination Criteria

In our study we considered, to the best of our knowledge, all XMLcompressors which are fulfilling the following conditions:

Is publicly and freely available either in the form of open source codesor binary versions.

Is schema-independent.

Be able to run under our Linux version of operating system.

Ubuntu 7.10 (Linux 2.6.20 Kernel)

Ubuntu 7.10 (Linux 2.6.22 Kernel)

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 5 / 20

Page 6: XML Compression Benchmark

Examined Compressors

XML Compressors List

Compressor Features Code Compressor Features CodeAvailable Available

GZIP (1.3.12) GAI Y XGrind SQI YBZIP2 (1.0.4) GAI Y XBzip SQI NPPM (j.1) GAI Y XQueC SQI NXMill (0.7) SAI Y XCQ SQI NXMLPPM (0.98.3) SAI Y XPress SQI NSCMPPM (0.93.3) SAI Y XQzip SQI NXWRT (3.2) SAI Y XSeq SQI NExalt (0.1.0) SAI Y QXT SQI NAXECHOP SAI Y ISX SQI NDTDPPM SAD Y XAUST SAD Yrngzip SQD Y Millau SAD N

Symbols list of XML compressors features

Symbol Description Symbol DescriptionG General Text Compressor S Specific XML CompressorD Schema dependent Compressor I Schema Independent CompressorA Archival XML Compressor Q Queriable XML Compressor

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 6 / 20

Page 7: XML Compression Benchmark

Testing Corpus (Data Sets)

Determining the XML files that should be used for evaluating the setof XML compression tools is not a simple task.

The documents of our corpus are classified into four categories:

Regular Documents: Regular document structure and short datacontents. They reflect the XML view of relational data. The data ratioof these documents is in the range of between 40% and 60%.Irregular documents: Very deep, complex and irregular structure.More challenging in terms of compression efficiency.Textual documents: Simple structure and high ratio of the contentsis preserved to the data values. The ratio of the data contents of thesedocuments represent more than 70% of the document size.Structural documents: No data contents at all. 100% of eachdocument size is preserved to its structure information. They are usedto assess the claim of XML conscious compressors on using the wellknown structure of XML documents for achieving higher compressionratios on the structural parts of XML documents.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 7 / 20

Page 8: XML Compression Benchmark

Testing Corpus (Data Sets)

Data Set Name Document Name Size (MB) Tags Number of Nodes Depth Data Ratio

EXI

Telecomp.xml 0.65 39 651398 7 0.48Weblog.xml 2.60 12 178419 3 0.31Invoice.xml 0.93 52 78377 7 0.57Array.xml 22.18 47 1168115 10 0.68Factbook.xml 4.12 199 104117 5 0.53Geographic Coordinates.xml 16.20 17 55 3 1

XMarkXMark1.xml 11.40 74 520546 12 0.74XMark2.xml 113.80 74 5167121 12 0.74XMark3.xml 571.75 74 25900899 12 0.74

XBench

DCSD-Small.xml 10.60 50 6190628 8 0.45DCSD-Normal.xml 105.60 50 6190628 8 0.45TCSD-Small.xml 10.95 24 831393 8 0.78TCSD-Normal.xml 106.25 24 8085816 8 0.78

Wikipedia

EnWikiNews.xml 71.09 20 2013778 5 0.91EnWikiQuote.xml 127.25 20 2672870 5 0.97EnWikiSource.xml 1036.66 20 13423014 5 0.98EnWikiVersity.xml 83.35 20 3333622 5 0.91EnWikTionary.xml 570.00 20 28656178 5 0.77

DBLP DBLP.xml 130.72 32 4718588 5 0.58U.S House USHouse.xml 0.52 43 16963 16 0.77SwissProt SwissProt.xml 112.13 85 13917441 5 0.60NASA NASA.xml 24.45 61 2278447 8 0.66Shakespeare Shakespeare.xml 7.47 22 574156 7 0.64Lineitem Lineitem.xml 31.48 18 2045953 3 0.19Mondial Mondial.xml 1.75 23 147207 5 0.77BaseBall BaseBall.xml 0.65 46 57812 6 0.11Treebank Treebank.xml 84.06 250 10795711 36 0.70

RandomRandom-R1.xml 14.20 100 1249997 28 0Random-R2.xml 53.90 200 3750002 34 0Random-R3.xml 97.85 300 7500017 30 0

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 8 / 20

Page 9: XML Compression Benchmark

Testing Environments

To ensure the consistency of the performance behaviors of theevaluated XML compressors, we ran our experiments on two differentenvironments.

One environment with high computing resources and the other withconsiderably limited computing resources.

High Resources Setup Limited Resources SetupOS Ubuntu 7.10 (Linux 2.6.22 Kernel) Ubuntu 7.10 (Linux 2.6.20 Kernel)CPU Intel Core 2 Duo E6850 Intel Pentium 4

3.00 GHz, FSB 1333MHz 2.66GHz, FSB 533MHz4MB L2 Cache 512KB L2 Cache

HD Seagate ST3250820AS - 250 GB Western Digital WD400BB - 40 GBRAM 4 GB 512 MB

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 9 / 20

Page 10: XML Compression Benchmark

Performance Criteria

We measure and compare the performance of the XML compression toolsusing the following metrics:

Compression Ratio: represents the ratio between the sizes ofcompressed and uncompressed XML documents.

Compression Ratio = (Compressed Size) / (Uncompressed Size)

Compression Time: represents the elapsed time during thecompression process.

Decompression Time: represents the elapsed time during thedecompression process.

For all metrics: the lower the metric value, the better the compressor.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 10 / 20

Page 11: XML Compression Benchmark

Experimental Framework

We evaluated 11 XML compressors: 3 general purpose textcompressors and 8 XML conscious compressors.

Our corpus consists of 57 documents: 27 original documents, 27structural copies and 3 randomly generated structural documents.

We run the experiments on two different platforms.

For each combination of an XML test document and an XMLcompressor, we run two different operations (compression -decompression).

To ensure accuracy, all reported numbers for our time metrics are theaverage of five executions with the highest and the lowest valuesremoved.

We created our own mix of Unix shell and Perl scripts to run andcollect the results of these huge number of runs.

The web page of this study provides access to the test files, examinedXML compressors and the detailed results of this study.http://xmlcompbench.sourceforge.net/.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 11 / 20

Page 12: XML Compression Benchmark

Experimental Results : Detailed Compression Ratios ofStructural Documents

Base

Ball

DB

LP

EnW

ikiN

ew

EnW

ikiQ

uote

EnW

ikiS

ourc

eE

nW

ikiV

ers

ityE

nW

ikT

ionary

EX

I-A

rray

EX

I-fa

ctbook

EX

I-In

voic

eE

XI-

Tele

com

pE

XI-

weblo

gLin

eite

mM

ondia

lN

asa

Shake

speare

Sw

issP

rot

Tre

ebank

US

House

DC

SD

-Norm

al

DC

SD

-Sm

all

TC

SD

-Norm

al

TC

SD

-Sm

all

XM

ark

1X

Mark

2X

Mark

3R

andom

-R1

Random

-R2

Random

-R3

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

Com

pre

ssio

n R

atio

Bzip2

Gzip

PPM

XMillBzip2

XMillGzip

XMillPPM

XMLPPM

XWRT

SCMPPM

Exalt

Axechop

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 12 / 20

Page 13: XML Compression Benchmark

Experimental Results : Detailed Compression Ratios ofOriginal Documents

Base

Ball

DB

LP

EnW

ikiN

ew

EnW

ikiQ

uote

EnW

ikiS

ourc

e

EnW

ikiV

ers

ity

EnW

ikT

ionary

EX

I-A

rray

EX

I-fa

ctbook

EX

I-G

eogC

oord

EX

I-In

voic

e

EX

I-T

ele

com

p

EX

I-w

eblo

g

Lin

eite

m

Mondia

l

Nasa

Shake

speare

Sw

issP

rot

Tre

ebank

US

House

DC

SD

-Norm

al

DC

SD

-Sm

all

TC

SD

-Norm

al

TC

SD

-Sm

all

XM

ark

1

XM

ark

2

XM

ark

3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Com

pre

ssio

n R

atio

Bzip2

Gzip

PPM

XMillBzip2

XMillGzip

XMillPPM

XMLPPM

XWRT

SCMPPM

Exalt

Axechop

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 13 / 20

Page 14: XML Compression Benchmark

Experimental Results : Average Compression Ratios

XM

illP

PM

XM

illG

zip

XM

illB

zip

2

XM

LP

PM

Bzip

2

Exalt

Axechop

Gzip

SC

MP

PM

PP

M

XW

RT

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Average C

om

pressio

n R

atio

(a) Structural documents.

SC

MP

PM

XM

LP

PM

XW

RT

XM

illB

zip

2

XM

illP

PM

Bzip

2

PP

M

XM

illG

zip

Gzip

0.12

0.14

0.16

0.18

0.20

0.22

0.24

Averg

ae C

om

pre

ssio

n R

atio

(b) Original documents.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 14 / 20

Page 15: XML Compression Benchmark

Overall Performance of XML Compressors

Compression Ratio Compression Time Decompression Time

0

1

2

3

4

5

6

7

Bzip2

Gzip

PPM

XMillBzip2

XMillGzip

XMillPPM

XMLPPM

XWRT

SCMPPM

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 15 / 20

Page 16: XML Compression Benchmark

Proposed Ranking of XML Compressors

The results of our experiments have not shown a clear winner.

Different ranking methods and different weights for the factors couldbe used for this task. Deciding the weight of each metric is mainlydependant on the scenarios and requirements of the applicationswhere these compression tools could be used.

We used three ranking functions which give different weights for ourperformance metrics:

WF1 = (1/3 * CR) + (1/3 * CT) + (1/3 * DCT).WF2 = (1/2 * CR) + (1/4 * CT) + (1/4 * DCT).WF3 = (3/5 * CR) + (1/5 * CT) + (1/5 * DCT).

where CR represents the compression ratio metric, CT represents thecompression time metric and DCT represents the decompression timemetric.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 16 / 20

Page 17: XML Compression Benchmark

Proposed Ranking of XML Compressors

WF1 WF2 WF3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Bzip2

Gzip

PPM

XMillBzip2

XMillGzip

XMillPPM

XMLPPM

XWRT

SCMPPM

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 17 / 20

Page 18: XML Compression Benchmark

Conclusions

The primary innovation in the XML compression mechanisms wasintroduced in XMill by separating the structural part of the XMLdocument from the data part and then group the related data itemsinto homogenous containers that can be compressed separably. Mostof the following XML compressors have simulated this idea indifferent ways.

The dominant practice in most of the XML compressors is to utilizethe well-known structure of XML documents for applying apre-processing encoding step and then forwarding the results of thisstep to general purpose compressors.

There are no publicly available solid implementations forgrammar-based XML compression techniques and queriable XMLcompressors.

The authors of the XML compressors should provide more attentionto provide the source code of their implementations available.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 18 / 20

Page 19: XML Compression Benchmark

Conclusions

We believe that this paper could be valuable for both the developers ofnew XML compression tools and interested users as well.

For developers, they can use the results of this paper to effectivelydecide on the points which can be improved in order to make aneffective contribution.

We recommend tackling the area of developing stable efficientqueriable XML compressors. Although there has been a lot of literaturepresented in this domain, we are still missing efficient, scalable andstable implementations in this domain.

For users, this study could be helpful for making an effective decisionto select the suitable compressor for their requirements.

For users with highest compression ratio requirement, the results of ourexperiments recommend the usage of either the PPM compressor withthe highest level of compression parameter or the XWRT compressorwith the highest level of compression parameter.For users with fastest compression time and moderate compression ratiorequirements, gzip and XMillGzip are considered to be the best choice.

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 19 / 20

Page 20: XML Compression Benchmark

The End

Thank You

S. Sakr (CSE, UNSW) BenchmarX’09 20 April 2009 20 / 20