9
VCODEX: A DATA COMPRESSION PLATFORM Kiem-Phong Vo AT&T Labs, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA Keywords: Data compression, transformation. Abstract: Vcodex is a platform to compress and transform data. A standard interface, data transform, is defined to rep- resent any algorithm or technique to encode data. Although primarily geared toward data compression, a data transform can perform any type of processing including encryption, portability encoding and others. Vcodex provides a core set of data transforms implementing a wide variety of compression algorithms ranging from general purpose ones such as Huffman or Lempel-Ziv to structure-driven ones such as reordering fields and columns in relational data tables. Such transforms can be reused and composed together to build more com- plex compressors. An overview of the software and data architecture of Vcodex will be presented. Examples and experimental results show how compression performance beyond traditional approaches can be achieved by customizing transform compositions based on data semantics. 1 INTRODUCTION Modern business systems routinely manage huge amounts of data. For example, the daily volumes of billing and network management data generated by an international communication company involve tens to hundreds of gigabytes (Fowler et al., 2004). Further, certain of such data is required by laws to be kept on- line for several years. Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general purpose tech- niques which treat data as unstructured streams of bytes. Algorithms based primarily on either pattern matching, statistical analysis or some combination thereof (Huffman, 1952; Jones, 1991; Witten, 1987; Ziv and Lempel, 1977; Ziv and Lempel, 1978) de- tect and remove different forms of information re- dundancy in data. Well-known compression tools such as the ubiquitous Unix Gzip or Windows Winzip are based on such general purpose techniques, for example, the Lempel-Ziv compression method (Ziv and Lempel, 1977) and Huffman coding (Huffman, 1952). A recently popular compressor, Bzip2 (Se- ward, 1994), first reorders data by the famous BWT or Burrows-Wheeler Transform (Burrows and Wheeler, 1994) to induce better compressibility before passing it to other compressors. Structures in data often provide a good source of information redundancy that can be exploited to improve compression performance. For example, columns or fields in relational data tend to be sparse and may share non-trivial relationships. Buchsbaum et al (Buchsbaum et al., 2003) developed the Pzip technique that assumes some conventional compres- sor such as Gzip and groups table columns to improve their compressibility by such a compressor. Vo and Vo (Vo and Vo, 2004; Vo and Vo, 2006) took a dif- ferent approach to the same problem and showed how to use dependency relations among table columns to rearrange data for better compressibility. For a dif- ferent class of data, namely XML, the Xmill com- pressor (Liefke and Suciu, 2000) works in a similar spirit to Pzip by grouping parts of a document with the same tags to be compressed by some conventional compressor. In practice, data often consist of ad-hoc mix- tures of structures beyond just simple tables or sin- gle XML documents. For example, a log file gen- erated by some network management system might contain many records pertaining to different network events. Due to their common formats, the records belonging to a particular event might form a table. As such, the compressibility of such a log file could be enhanced by grouping records by event types first before compressing each group by one of the above 81 Vo K. (2007). VCODEX: A DATA COMPRESSION PLATFORM. In Proceedings of the Second International Conference on Software and Data Technologies - SE, pages 81-89 DOI: 10.5220/0001344600810089 Copyright c SciTePress

VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

VCODEX: A DATA COMPRESSION PLATFORM

Kiem-Phong VoAT&T Labs, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA

Keywords: Data compression, transformation.

Abstract: Vcodex is a platform to compress and transform data. A standard interface, data transform, is defined to rep-resent any algorithm or technique to encode data. Although primarily geared toward data compression, a datatransform can perform any type of processing including encryption, portability encoding and others. Vcodexprovides a core set of data transforms implementing a wide variety of compression algorithms ranging fromgeneral purpose ones such as Huffman or Lempel-Ziv to structure-driven ones such as reordering fields andcolumns in relational data tables. Such transforms can be reused and composed together to build more com-plex compressors. An overview of the software and data architecture of Vcodex will be presented. Examplesand experimental results show how compression performance beyond traditional approaches can be achievedby customizing transform compositions based on data semantics.

1 INTRODUCTION

Modern business systems routinely manage hugeamounts of data. For example, the daily volumes ofbilling and network management data generated by aninternational communication company involve tens tohundreds of gigabytes (Fowler et al., 2004). Further,certain of such data is required by laws to be kept on-line for several years. Thus, data compression is acritical component in these systems.

Much of compression research has traditionallybeen focused on developing general purpose tech-niques which treat data as unstructured streams ofbytes. Algorithms based primarily on either patternmatching, statistical analysis or some combinationthereof (Huffman, 1952; Jones, 1991; Witten, 1987;Ziv and Lempel, 1977; Ziv and Lempel, 1978) de-tect and remove different forms of information re-dundancy in data. Well-known compression toolssuch as the ubiquitous Unix Gzip or Windows Winzipare based on such general purpose techniques, forexample, the Lempel-Ziv compression method (Zivand Lempel, 1977) and Huffman coding (Huffman,1952). A recently popular compressor, Bzip2 (Se-ward, 1994), first reorders data by the famous BWT orBurrows-Wheeler Transform (Burrows and Wheeler,1994) to induce better compressibility before passingit to other compressors.

Structures in data often provide a good sourceof information redundancy that can be exploited toimprove compression performance. For example,columns or fields in relational data tend to be sparseand may share non-trivial relationships. Buchsbaumet al (Buchsbaum et al., 2003) developed the Pziptechnique that assumes some conventional compres-sor such as Gzip and groups table columns to improvetheir compressibility by such a compressor. Vo andVo (Vo and Vo, 2004; Vo and Vo, 2006) took a dif-ferent approach to the same problem and showed howto use dependency relations among table columns torearrange data for better compressibility. For a dif-ferent class of data, namely XML, the Xmill com-pressor (Liefke and Suciu, 2000) works in a similarspirit to Pzip by grouping parts of a document withthe same tags to be compressed by some conventionalcompressor.

In practice, data often consist of ad-hoc mix-tures of structures beyond just simple tables or sin-gle XML documents. For example, a log file gen-erated by some network management system mightcontain many records pertaining to different networkevents. Due to their common formats, the recordsbelonging to a particular event might form a table.As such, the compressibility of such a log file couldbe enhanced by grouping records by event types firstbefore compressing each group by one of the above

81Vo K. (2007).VCODEX: A DATA COMPRESSION PLATFORM.In Proceedings of the Second International Conference on Software and Data Technologies - SE, pages 81-89DOI: 10.5220/0001344600810089Copyright c© SciTePress

Page 2: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

table compressors. Unfortunately, from the point ofview of building compression tools, it is hard, if notimpossible, to automatically detect and take advan-tage of such ad hoc structures in data. Nonethe-less, this points out that, in each particular circum-stance, optimum compression performance may re-quire the ability to form arbitrary compositions ofdiverse data transformations, both generic and data-specific. Indeed, this ability is implicitly needed evenin implementing general purpose compressors such asGzip and Bzip2 that combine distinct techniques deal-ing with different aspects of information redundancy.Thus, a major challenge to compression research aswell as to practical use of compression on large scaledata is to provide a way to construct, reuse and in-tegrate different data transformation and compressiontechniques to suit particular data semantics. This isthe problem that the Vcodex platform addresses.

Central to Vcodex is the notion of data transformsor software components to alter data in some invert-ible ways. As it is beyond the scope of this paper,we only mention that by taking this general approachto data transformation Vcodex can also accommodatetechniques beyond compression such as encryption,translation between character sets or just simple cod-ing of binary data for portability. For maximum us-ability, two main issues must be addressed:

• Defining a standard software interface for datatransforms: A standardized software interfacehelps transform users as it eases implementationof applications and ensures that application coderemains stable when algorithms change. Trans-form developers also benefit as a standard inter-face simplifies and encourages reuse of existingtechniques in implementing new ones.

• Defining a self-describing and portable standarddata format: As new data transforms may becontinually developed by independent parties, itis important to have a common data format thatcan accommodate arbitrary composition of trans-forms. Such a format should allow receivers ofdata to decode them without having to know howthey were encoded. In addition, encoded datashould be independent of OS and hardware plat-forms so that they can be easily transported andshared.

The rest of the paper gives an overview of theVcodex platform and how it addresses the abovesoftware and data issues. Experiments based onthe well-known Canterbury Corpus (Bell and Powell,2001) for testing data compressors shows that Vcodexcould far outperform conventional compressors suchas Gzip or Bzip2.

2 SOFTWARE ARCHITECTURE

Vcodex is written in the C language in a style com-patible with C++. The platform is divided into twomain layers. The base layer is a software librarydefining and providing a standard software interfacefor data transforms. This layer assumes that data areprocessed in segments small enough to fit entirely inmemory. The library can be used directly by applica-tions written in C or C++ in the same way that other Cand C++ libraries are used. A command tool Vczip iswritten on top of the library layer to enable file com-pression without low-level programming. Large filesare handled by being broken into chunks with suitablesizes for in-memory processing.

2.1 Library Design

Transformation handle and operationsVcodex t – maintaining contexts and states

vcopen(), vcclose(), vcapply(), ...

Discipline structure Data transformsVcdisc t Vcmethod t

Parameters for data processing, Burrows-Wheeler, Huffman,Event handler Delta compression, Table transform, etc.

Figure 1: A discipline and method library architecture forVcodex.

Figure 1 summarizes the design of the baseVcodex library which was built in the style of the Dis-cipline and Method library architecture (Vo, 2000).The top part shows that a transformation handle oftype Vcodex t provides a holding place for contextsused in transforming different data types as well asstates retained between data transformation calls. Avariety of functions can be performed on such han-dles. The major ones are vcopen() and vcclose()for handle opening and closing and vcapply() fordata encoding or decoding. A transformation han-dle is parameterized by an optional discipline struc-ture of type Vcdisc t and a required data transformof type Vcmethod t. Each discipline structure is sup-plied by the application and provides additional infor-mation about the data to be processed. On the otherhand, a data transform is selected from a predefinedset and specifies the transformation technique. Sec-tion 2.2 describes a few common data transforms suchas Burrows-Wheeler, Huffman, etc. Complex datatransformations can be composed from simpler onesby passing existing handles into newly opened han-dles for additional processing.

Figure 2 shows an example to construct and use atransformation handle for delta compression by com-posing two data transforms: Vcdelta and Vchuffman(Section 2.2). Here, delta compression (Hunt et al.,1998) is a general technique to compress a target

ICSOFT 2007 - International Conference on Software and Data Technologies

82

Page 3: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

1. typedef struct _vcdisc_s2. { void* data;3. ssize_t size;4. Vcevent_f eventf;5. } Vcdisc_t;6. Vcdisc_t disc = { "Source data to compare against", 30, 0 };

7. Vcodex_t* huff = vcopen(0, Vchuffman, 0, 0, VC_ENCODE);8. Vcodex_t* diff = vcopen(&disc, Vcdelta, 0, huff, VC_ENCODE);

9. ssize_t cmpsz = vcapply(diff, "Target data to compress", 23, &cmpdt);

Figure 2: An example of delta compression.

data given a related source data. It is often used toconstruct software patches or to optimize storage inrevision control systems (Tichy, 1985). The trans-form Vcdelta implements a generalized Lempel-Zivparsing technique for both delta and normal compres-sion whose data format was the subject of the IETF(Internet Engineering Task Force) VCDIFF ProposedStandard RFC3284 (Korn et al., 2002; Korn and Vo,2002).

• Lines 1–6 show the type of a discipline structureand disc, an instance of it. The fields data andsize of disc provide any source data available tocompare against in delta compression. Even if nosource data is given, the target data would stillbe compressed in the usual Lempel-Ziv fashion,i.e., by factoring out repeating patterns. The fieldeventf of disc is optional and, if used, specifies afunction to process events such as handle openingand closing.

• Lines 7 and 8 show handle creation and composi-tion. The Huffman coding handle huff is createdfirst. The first and second arguments to vcopen()specify the optional discipline structure, e.g., discon line 8, and the selected data transform, e.g.,Vchuffman on line 7. Variants or modes of a trans-form can be given in the third argument (see theexamples in Figure 5). The fourth argument isused to compose a transform with another one.For example, on line 8, the handle huff is usedto further compress the output of diff. The lastargument of vcopen() must be one of VC ENCODEor VC DECODE to tell if the handle is opened for en-coding or decoding.

• Line 9 gives an example of calling vcapply() tocompress some given data. The compressed resultis returned in cmpdt while its size is returned bythe call. A transform such as Vcdelta creates mul-tiple output data segments coding different typesof data (Section 3). Each such segment wouldbe separately passed to the follow-on transforma-tion handle, in this case, huff. As such, huff cangenerate the appropriate Huffman codes to matchwith each segment and improve compression per-formance. Note also that the source data in thediscipline structure disc could be changed before

each vcapply() call to match with the data to becompressed.

The above example shows the encoding side.The decoding side is exactly the same except thatVC DECODE replaces VC ENCODE and the data passedto vcapply() would be some previously compresseddata. An advantage in being able to write applica-tion code based on a standard interface as provided isthat algorithmic enhancement of the data transform inuse does not require changes at the application level.In addition, experimentation with different data trans-forms only need minimal code editing.

2.2 Data Transforms

1. typedef struct _vcmethod_s2. { ssize_t (*encodef)(Vcodex_t*, void*, ssize_t, void**);3. ssize_t (*decodef)(Vcodex_t*, void*, ssize_t, void**);4. int (*eventf)(Vcodex_t*, int, void*);5. char* name;6. ...7. } Vcmethod_t;

Figure 3: The type of a data transform.

Figure 3 shows Vcmethod t, the type of a datatransform. Lines 2 and 3 show that each transformprovides two functions for encoding and decoding.An optional event handling function eventf, if pro-vided, is used to process certain events. For example,some transforms maintain states throughout the life-time of a handle. The structures to keep such stateswould be created or deleted at handle opening andclosing via such event handling functions. Each trans-form also has a name that uniquely identifies it amongthe set of all available transforms. This name is en-coded in the output data (Section 3) to make the en-coded data self-describing.

Vcodex provides a large collection of trans-forms for building efficient compressors of generaland structured data including a number that areapplication-specific. Below are brief overviews of afew important data transforms:

• Vcdelta: This implements a delta compres-sor based on a generalized Lempel-Ziv parsingmethod. It outputs data in the Vcdiff encodingformat as described in the IETF Proposed Stan-dard RFC3284 (Korn et al., 2002).

• Vcsieve: This transform used approximatematching to perform delta compression as wellas compression of genetic sequences, a class ofdata notoriously difficult to compress (Manziniand Rastero, 2004).

• Vcbwt: This implements the Burrows-Wheelertransform.

VCODEX: A DATA COMPRESSION PLATFORM

83

Page 4: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

• Vctranspose: This treats a dataset as a table, i.e.,a two-dimension array of bytes, and transposes it.

• Vctable: This uses column dependency (Vo andVo, 2006) to transform table data and enhancetheir compressibility.

• Vcmtf: This provides the move-to-front data trans-form (Bentley et al., 1986) and a predictive vari-ant of it. The predictive variant uses a heuristicalgorithm that keeps track of how often a charac-ter follows another and, when favorable, moves itto the front as soon as its predecessor is encoded.

• Vcrle: This provides the run-length encoder anda variant that only encodes runs of zeros using aspecial binary coding method (Deorowicz, 2000).

• Vchuffman: This implements static Huffman cod-ing.

• Vchuffgroup: This is implemented on top ofVchuffman. It divides data into short segments ofequal length and groups segments compress welltogether with a single Huffman code table.

• Vcama: This collects records with the same lengthtogether to form tables that could then be com-pressed via Vctable. It was originally written todeal with an arcane data format specific to tele-phone data called AMA, hence the name. But itcan also handle other general record types.

• Vcmap: This provides a collection of methods tomap data from one character set to another. Forexample, different variants translate data betweenASCII and various versions of EBCDIC.

2.3 The Vczip Command

The Vczip command enables usage of the availabledata transforms without low level programming. Itprovides a syntax to compose different transforms to-gether to process a data file. In this way, end users canexperiment with different combinations of transformson their data to optimize compressibility.

Figure 4 shows the transform composition syntax.Lines 1 and 2 show that each transform is specified bya name, e.g., delta or table, followed by an optionalsequence of arguments depending on the particulartransform. Lines 3 and 4 show that such argumentsare separated by periods. Lines 5 and 6 show that thelist of transforms can be empty to signify usage of adefault compression method defined by Vczip. Butif not, it must start with -m and follow by a comma-separated sequence of transforms.

Figure 5 shows a snapshot of an experiment toexplore different ways to compress data using Gzip,

Bzip2 and various compositions of Vcodex data trans-forms. The file kennedy.xls was an Excel spreadsheettaken from the Canterbury Corpus (Bell and Powell,2001).

• Lines 1–6 show the raw size of kennedy.xls andcompression results by Gzip and Bzip2. Gzipcompressed the data by roughly a factor of 5 whileBzip2 achieved a factor of 8 to 1.

• Lines 7–10 show Vczip instantiated as differentvariations of a BWT compressor. The 0 argumentto the run length encoding transform rle meantthat only runs of 0’s were coded. The same argu-ment to the move-to-front transform mtf restrictedit to moving a data byte only after access. As thatwas analogous to Bzip2, about the same perfor-mance resulted. On the other hand, line 9 useda variant that aggressively predicted and moveddata (Section 2.2) resulting in a 12 to 1 compres-sion factor.

• Lines 11-14 show the use of the table transformVctable (Vo and Vo, 2006) to treat the data as a2-dimensional array of bytes and reorder data bycolumn dependency before running same back-end data transforms as earlier. The compressionfactor improved to about 19 to 1 with that. Insert-ing the Burrows-Wheeler transform after the tabletransform further improved the compression fac-tor to nearly 30 to 1.

• Line 15 shows that the compressed data was de-coded into a file x and compared against the orig-inal data to test correctness. As shown, decom-pression would always be done with the option -u,i.e., no knowledge of transforms used on encodingrequired.

Figure 6 shows another experiment to compressa large file of telephone data in a proprietary formatcalled AMA. This type of data consists of records inwhich the first 4 bytes of a record tell the length ofthe record. The Vcama transform collected records ofthe same length together before passing them to thecolumn dependency table transform Vctable. Again,this ability to exploit structures in data resulted in thebest compression performance.

3 DATA ARCHITECTURE

The output of a transformation is data to store or trans-port. For maximal usability, the main issues to be ad-dressed are: portability and self-description. Porta-bility means that primitive types such as bits, integersand strings should be standardly encoded so that data

ICSOFT 2007 - International Conference on Software and Data Technologies

84

Page 5: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

1. Transform -> {delta, table, bwt, huffman, ama, ...}2. Transform -> {delta, table, bwt, huffman, ama, ...} . Arglist(Transform)3. Arglist(Transform) -> Nil4. Arglist(Transform) -> Arg(Transform) . Arglist(Transform)5. TransformList -> Nil6. TransformList -> "-m" TransformList , Transform

Figure 4: Language syntax to compose transforms.

1. ls -l kennedy.xls2. -rw------- 4 kpv 1029744 Nov 11 1996 kennedy.xls

3. gzip < kennedy.xls > out; ls -l out4. -rw-r--r-- 1 kpv 206767 Apr 11 12:30 out

5. bzip2 < kennedy.xls > out; ls -l out6. -rw-r--r-- 1 kpv 130280 Apr 11 12:30 out

7. vczip -mbwt,mtf.0,rle.0,huffgroup < kennedy.xls > out; ls -l out8. -rw-r--r-- 1 kpv 129946 Apr 11 12:31 out

9. vczip -mbwt,mtf,rle.0,huffgroup < kennedy.xls > out; ls -l out10. -rw-r--r-- 1 kpv 84281 Apr 11 12:31 out

11. vczip -mtable,mtf,rle.0,huffgroup < kennedy.xls > out; ls -l out12. -rw-r--r-- 1 kpv 53918 Apr 11 12:31 out

13. vczip -mtable,bwt,mtf,rle.0,huffgroup < kennedy.xls > out; ls -l out14. -rw-r--r-- 1 kpv 35130 Apr 11 12:31 out

15. vczip -u < out >x; cmp x kennedy.xls

Figure 5: Experimenting with different combinations of data transforms.

can be produced on one OS/hardware platform andtransparently decoded on another. Self-descriptionmeans that data can be decoded without knowledgeof how they were encoded.

3.1 Portable Data Encoding

Transform writers must take care to ensure portabilityof output data. Vcodex provides a variety of func-tions to encode strings, bits, and integers in portableforms. String data are assumed to be in the ASCIIcharacter set. When running programs on a platformnot based on ASCII, for example, IBM mainframeswith EBCDIC, a function vcstrcode() can be usedto transcribe string data from the native character setinto ASCII for storage. For example, the identifica-tion string of a data transform is often constructedfrom its name (Section 2.2). As the name is in thenative character set, it would need to be translated toASCII for portability.

For bit encoding, it is assumed that the fundamen-tal unit of storage is an 8-bit byte. The bits in a byte

are position from left to right, i.e., the highest bit is atposition 0, the second highest bit at position 1, and soon. Then a sequence of bits would be simply imposedonto a sequence of bytes in that order. For example,the 12-bit sequence 101010101111 would be coded inthe 2-byte sequence 170 240. That is, the first eightbits, 10101010, are stored in a byte so that byte wouldhave value 170. The last four bits, 1111, are paddedwith 0’s before storage so the representing byte wouldhave value 240.

Integers are unsigned and encoded in a variable-sized format originally introduced in the Sfio li-brary (Korn and Vo, 1991). Each integer is treated asa number in base 128 so that each digit can be storedin a single byte. Except for the least significant one,each byte would turn on its most significant bit, MSB,to indicate that the next byte is a part of the encoding.For example, consider the integer 123456789 whichis represented by four digits 58, 111, 26, 21 in base128. Below are the coding of these digits shown inbits for clarity.

VCODEX: A DATA COMPRESSION PLATFORM

85

Page 6: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

1. ls -l ningaui.ama2. -rw-r--r-- 1 kpv kpv 73452794 Sep 12 2005 ningaui.ama

3. gzip < ningaui.ama >out; lsl out4. -rw-r--r-- 1 kpv kpv 30207067 Apr 11 12:55 out

5. bzip2 < ningaui.ama >out; lsl out6. -rw-r--r-- 1 kpv kpv 22154475 Apr 11 12:57 out

7. vczip -mama,table,mtf,rle.0,huffgroup < ningaui.ama >out; lsl out8. -rw-r--r-- 1 kpv kpv 20058182 Apr 11 13:00 out

Figure 6: Compression based on data-specific structures.

10111010 11101111 10011010 00010101

MSB+58 MSB+111 MSB+26 0+21

3.2 Self-describing Data

A large file might need to be divided into segmentssmall enough to process in memory. Each such seg-ment is called a window. The internal representationof a transformed window would depend on what datatransforms were used. But at the file level, that can betreated simply as a sequence of bytes.

1. [ID_file:] 0xd6 0xc3 0xc4 0xd82. [Reserved:] Size Data3. [Transforms:] Size4. [Transform1:] ID_transform5. [Argument:] Size Data6. [Transform2:] ID_transform7. [Argument:] Size Data8. ...9. [Window1:] Indicator

10 [Transformed data]: Size Data11. [Window2:] Indicator12 [Transformed data]: Size Data13. ...

Figure 7: The self-describing format of transformed data.

Figure 7 shows the structure of encoded data inwhich each file consists of a set of header data fol-lowed by one or more Window sections and gives anexample of header data.• Line 1 shows that each file is identified by four

starting bytes which are the ASCII letters V, C, Dand X with their high bits on.

• Line 2 shows a Reserved section usable by an ap-plication to store additional data. As shown hereand later, each data element is represented by aSize and Data pair that tells number of data bytesfollowed by the data itself.

• Lines 3–8 show the Transforms section whichencodes the original composition of transforms.

This starts with Size, the length in bytes of therest of the section. Each transform consists of anidentification string which, by convention, mustbe in ASCII followed by any additional data.

• Lines 9–13 show the list of Window sections.Each window section starts with an Indicatorbyte which can be one of the values VC RAW,VC SOURCEFILE and VC TARGETFILE. VC RAW meansthat the encoded data were not transformed. Thelatter two values are as defined in the Vcdiff Pro-posed Standard RFC3284 (Korn et al., 2002) toindicate that the data was delta compressed (Huntet al., 1998) against data from either the source ortarget file respectively. Following the Indicatorbyte is the transformed data.

0xd6 0xc3 0xc4 0xd8 [ID_file]0 [Reserved size]16 [Transforms size]delta\0 [ID_transform]0 [Argument size]huffman\0 [ID_transform]0 [Argument size]

Figure 8: Header data and data schema of a delta compres-sor.

Figure 8 gives an example based on a deltacompressor resulted from composing the transformsVcdelta and Vchuffman. The top part shows first theliteral bytes defining the header data which essen-

ICSOFT 2007 - International Conference on Software and Data Technologies

86

Page 7: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

tially only coded the names of the transforms. Thebottom part shows pictorially the various componentsof a compressed file. Each file starts with the headerdata which tells the file type via the four magic bytesVCDX and the transform composition sequence. Thecompressed data are divided into a sequence of win-dows. The raw data in each window is processedfirst by Vcdelta to generate the so-called delta in-structions defined in the IETF RFC3284 (Korn et al.,2002). In this method, three different sections of dataare generated, instructions, addresses of matches andunmatched data. Each section would then be pro-cessed separately by the following Huffman compres-sor Vchuffman. Partitioning of data by types in thisway helps to improve compression as the differentsections often have distinct statistical properties.

4 PERFORMANCE

An experiment was done to test compression perfor-mance by tailoring transform compositions to partic-ular data. The small and large files from CanterburyCorpus (Bell and Powell, 2001) were used to compareVczip, Gzip and Bzip2. Sizes were measured in bytesand times in seconds. Timing results were obtainedon a Pentium 4, 2.8GHZ, with 1Gb RAM, runningRedhat Linux.

Figure 9 presents compression and timing data intwo tables. The top and bottom parts of each ta-ble show results of the small and large files respec-tively. The Algs entries show the algorithm composi-tions used to compress the files. Below are the algo-rithms:

• B: The Burrows-Wheeler transform Vcbwt.

• M: The move-to-front transform Vcmtf with predic-tion.

• R: The run-length encoder Vcrle with coding onlyruns of 0’s.

• h: The Huffman coder Vchuffman.

• H: The Huffman coder with groupingVchuffgroup.

• T: The table transform Vctable.

• S: The delta compressor Vcsieve using approxi-mate matching.

• D: The transform Vcsieve with reverse matchingand mapping of the letter pairs A and T, G and C,etc.

The Bits entries show the average number of bitsneeded to encode a byte with each row’s overall win-ner in bold. Vczip only lost slightly to Bzip2 on

cp.html and to Gzip on grammar.lsp. Its weightedcompression rates of 1.21 bits per byte for the smallfiles and 1.68 for the large files outdid the best re-sults of 1.49 and 1.72 shown at the Corpus website.The pure Huffman coder Vchuffman outperformed itsgrouping counterpart Vchuffgroup on small files asthe cost of coding groups became too expensive insuch cases.

The best cases of Vczip typically ran somewhatslower than Bzip2 and Gzip. Although some speedloss was due to the general nature of transform com-position, the major part was due to sophisticated al-gorithms such as move-to-front with prediction. Inpractice, applications often make engineering choicesbetween faster data transforms with worse compres-sion vs. slower ones with better compression.

For any collection of transforms, the number ofalgorithm combinations that make sense tend to besmall and dictated by the nature of the transforms.For example, Huffman coding should never be usedbefore other transforms as it destroys any structure indata. Thus, although not done here, it is possible tofind an optimal composition sequence by trying allcandidate composition sequences on small data sam-ples.

5 RELATED WORKS

Vcodex introduces the notion of data transforms asstandard software components to encapsulate com-pression algorithms for reusability. A variety oflibrary packages such as zlib (Gailly and Adler,2005) or bzlib (Seward, 1994) also aim at improvingreusability. However, except for their overall externalcompression interface, the constituent algorithms inthese packages are buried in the implemented code.For example, although Huffman coding is requiredin both zlib and bzlib, different versions were hard-coded with incompatible data formats. By contrast,new Vcodex data transforms are often crafted fromexisting ones via either composition or direct usagein implementation. Application systems with propri-etary data types may also develop special data trans-forms for such types to enhance compressibility. Inthis way, exotic compositions of compression algo-rithms can be made to fit any particular data semanticsand gain optimal compression performance.

A side of data compression often overlooked is thedesign of the output data. If not carefully done, suchdata may not be sharable across machines due to dif-ferent architectures or may become stale and unusableover time due to algorithm evolution. Unfortunately,few compression tools have ever published their cod-

VCODEX: A DATA COMPRESSION PLATFORM

87

Page 8: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

Vczip Bzip2 GzipFile Size Algs Cmpsz Bits Cmpsz Bits Cmpsz Bits

alice29.txt 152089 BMRH 42654 2.24 43202 2.27 54423 2.86ptt5 513216 TBMRH 33149 0.52 49759 0.78 56438 0.88fields.c 11150 BMRh 3036 2.18 3039 2.18 3134 2.25kennedy.xls 1029744 TBMRH 35130 0.27 130280 1.01 206767 1.61sum 38240 SBMRH 12406 2.60 12909 2.70 12920 2.70lcet10.txt 426754 BMRH 105778 1.98 107706 2.02 144874 2.72plrabn12.txt 481861 BMRH 143817 2.39 145577 2.42 195195 3.24cp.html 24603 BMRh 7703 2.50 7624 2.48 7991 2.60grammar.lsp 3721 BMRh 1274 2.74 1283 2.76 1234 2.65xargs.1 4227 BMRh 1728 3.27 1762 3.33 1748 3.31asyoulik.txt 125179 BMRH 39508 2.52 39569 2.53 48938 3.13Total 2810784 426183 1.21 542710 1.54 733662 2.09

E.coli 4638690 DH 1137801 1.96 1251004 2.16 1341243 2.31bible.txt 4047392 BMRH 786709 1.55 845623 1.67 1191061 2.35world192.txt 2473400 BMRH 425077 1.37 489583 1.58 724593 2.34Total 11159482 2349587 1.68 2586210 1.85 3256897 2.33

Vczip Bzip2 GzipFile Cmptm Dectm Cmptm Dectm Cmptm Dectm

alice29.txt 0.05 0.01 0.03 0.01 0.01 0.01ptt5 0.27 0.08 0.03 0.01 0.02 0.01fields.c 0.01 0.01 0.01 0.01 0.01 0.01kennedy.xls 0.27 0.18 0.23 0.07 0.11 0.01sum 0.03 0.01 0.01 0.01 0.01 0.01lcet10.txt 0.17 0.08 0.12 0.06 0.05 0.01plrabn12.txt 0.21 0.11 0.15 0.09 0.08 0.01cp.html 0.01 0.01 0.01 0.01 0.01 0.01grammar.lsp 0.01 0.01 0.01 0.01 0.01 0.01xargs.1 0.01 0.01 0.01 0.01 0.01 0.01asyoulik.txt 0.04 0.01 0.02 0.01 0.01 0.01

E.coli 7.85 0.10 1.92 1.03 1.93 0.08bible.txt 2.47 1.04 1.64 0.72 0.55 0.06world192.txt 1.31 0.62 1.04 0.42 0.23 0.03

Figure 9: Compression size (bytes), rate (bits/byte) and time (seconds) for the Canterbury Corpus.

ing formats. Two rare exceptions are the Deflate For-mat (Deutsch, 1996) used in Gzip and the Vcdiff For-mat (Korn et al., 2002) for delta encoding which havebeen standardized and sanctioned by the Internet En-gineering Task Force to facilitate transporting of com-pressed data over the Internet. An appealing featureof the Vcodex self-describing data architecture is thatthe data format of each transform can be kept in-dependent from others that may be composed withit in applications. This allows any specification andstandardization effort to focus on the format of eachtransform independently from others. This should becontrasted, for example, with Deflate which must de-fine in details both the output of its main Lempel-Zivcoder and also that of the low level Huffman coderused to clean up any remaining statistical redundancyin the Lempel-Ziv output.

6 CONCLUSION

Advances in compression are continually pulled bytwo opposing forces: generalization to devise tech-niques applicable to all data types and specializationto devise techniques exploiting specific structures incertain classes of data. General techniques such asLempel-Ziv or Burrows-Wheeler Transform are easyto use and do perform well with most data. How-ever, as seen in our experiments earlier, such tech-niques could seldom match the performance of algo-rithms specifically designed for classes of data suchas tables, DNA sequences, etc. Unfortunately, thecost to develop data-specific compression tools canquickly become prohibitive considering the plethoraof arcane data types used in large application systems.Vcodex provides a unique solution to this problem byapplying a software and data engineering approach tocompression. Its data transform interface frees algo-rithm designers to focus only on the data transfor-

ICSOFT 2007 - International Conference on Software and Data Technologies

88

Page 9: VCODEX: A DATA COMPRESSION PLATFORM · Thus, data compression is a critical component in these systems. Much of compression research has traditionally been focused on developing general

mation task at hand without having to be concernedwith other transforms that they may require. Further,users of compression also have the opportunity to mixand match data transforms to optimize the compress-ibility of their data without being locked into somecompositional sequences picked by the tool design-ers. The best indication of success for Vcodex is thatthe platform has been in use for a few years in a num-ber of large data warehouse applications handling ter-abytes of data daily. Tranform compositions properlytailored to data types help achieving compression ra-tios up to hundreds to one, resulting in significant costsavings. A variety of data transforms have been con-tinually developed along with new data types withoutdisrupting ongoing usage. Further, as the number ofdata types is limited per application system, simplelearning algorithms based on training with small sam-ples of data could often be developed to automaticallyfind optimal combinations of transforms for effectivecompression. In this way, the platform has proven tobe an effective tool to help advancing compression notjust in specialization but also in generalization.

REFERENCES

T. Bell and M. Powell. The Canterbury Corpus,http://corpus.canterbury.ac.nz. Technical Report,2001.

J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A LocallyAdapative Data Compression Scheme. Comm. of theACM, 29:320–330, 1986.

A. Buchsbaum, G.S. Fowler, and R. Giancarlo. ImprovingTable Compression with Combinatorial Optimization.J. of the ACM, 50(6):825–51, 2003.

M. Burrows and D.J. Wheeler. A Block-Sorting LosslessData Compression Algorithm. Report 124, DigitalSystems Research Center, 1994.

S. Deorowicz. Improvements to Burrows-Wheeler Com-pression Algorithm. Software—Practice and Experi-ence, 30(13):1465–1483, 2000.

P. Deutsch. DEFLATE Compressed Data Format Spec-ification version 1.3. In http://www.ietf.org. IETFRFC1951, 1996.

G.S. Fowler, A. Hume, D.G. Korn, and K.-P. Vo. Migratingan MVS Mainframe Application to a PC. In Proceed-ings of Usenix’2004. USENIX, 2004.

J. Gailly and M. Adler. Zlib, http://www.zlib.net. Technicalreport, 2005.

D.A. Huffman. A Method for the Construction ofMinimum-Redundancy Codes. Proc. of the IRE,40(9):1098–1101, Sept 1952.

J.J. Hunt, K.-P. Vo, and W.F. Tichy. Delta Algorithms: AnEmpirical Analysis. ACM Transactions on SoftwareEngineering and Methodology, 7:192–214, 1998.

D.W. Jones. Practical Evaluation of a Data CompressionAlgorithm. In Data Compression Conference. IEEEComputer Society Press, 1991.

David G. Korn and Kiem-Phong Vo. SFIO: Safe/FastString/File IO. In Proc. of the Summer ’91 UsenixConference, pages 235–256. USENIX, 1991.

D.G. Korn, J. MacDonals, J. Mogul, and K.-P. Vo.The VCDIFF Generic Differencing and Compres-sion Data Format. Internet Engineering Task Force,www.ietf.org, RFC 3284, 2002.

D.G. Korn and K.-P. Vo. Engineering a Differencingand Compression Data Format. In Proceedings ofUsenix’2002. USENIX, 2002.

H. Liefke and D. Suciu. Xmill: an efficient compressor forxml data. In Proc. of SIGMOD, pages 153–164, 2000.

G. Manzini and M. Rastero. A Simple and Fast DNA Com-pression Algorithm. Software—Practice and Experi-ence, 34:1397–1411, 2004.

J. Seward. Bzip2, http://www.bzip.org. Technical report,1994.

W. F. Tichy. RCS—a system for version control. Software—Practice and Experience, 15(7):637–654, 1985.

B.D. Vo and K.-P. Vo. Using Column Dependency to Com-press Tables. Data Compression Conference, 2004.

B.D. Vo and K.-P. Vo. Compressing Table Data with Col-umn Dependency. Theoretical Computer Science, ac-cepted for publication, 2006.

K.-P. Vo. The Discipline and Method Architecture forReusable Libraries. Software—Practice and Experi-ence, 30:107–128, 2000.

I.H. Witten, M. Radford, and J.G. Cleary. ArithmeticCoding for Data Compression. Comm. of the ACM,30(6):520–540, June 1987.

J. Ziv and A. Lempel. A Universal Algorithm for Sequen-tial Data Compression. IEEE Trans. on InformationTheory, 23(3):337–343, May 1977.

J. Ziv and A. Lempel. Compression of Individual Sequencesvia Variable-Rate Coding. IEEE Trans. on Informa-tion Theory, 24(5):530–536, 1978.

VCODEX: A DATA COMPRESSION PLATFORM

89