46
M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary- based Fast Transform Weifeng Sun [email protected] Computer Science Department University of Central Florida

M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Embed Size (px)

DESCRIPTION

Star-transform: Roadmap Static Dictionary ( LIPT: Dictionary Based Transform + Ternary search tree for fast Transform encoding + Better mapping for fast Transform decoding => StarNT: Dictionary Based Fast Transform + Domain-specific dictionaries => StarZip: Multi-corpora Text Compr. System M5 research group, University of Central Florida Weifeng Sun 3 20 November 2002

Citation preview

Page 1: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

M5 research group, University of Central Florida

Weifeng Sun1

20 November 2002

StarNT: Dictionary-based Fast Transform

Weifeng [email protected]

Computer Science DepartmentUniversity of Central Florida

Page 2: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Current Text Compression Model

– First-order entropy coder• Huffman (& word-based Huffman)• Arithmetic: arbitrary precision

– PPM (bzip2 -9): prediction based on history• BWT explores unbounded context information

– LZ-family (gzip –9, fast)• Adaptive dictionary• Encode repeated pattern based on history

– Others: DMC, etc M5 research group, University of Central Florida

Weifeng Sun2

20 November 2002

Page 3: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Star-transform: Roadmap• Static Dictionary ( <= LZ-family)• Better context information for PPM/BWT

=> LIPT: Dictionary Based Transform+ Ternary search tree for fast Transform encoding+ Better mapping for fast Transform decoding

=> StarNT: Dictionary Based Fast Transform+ Domain-specific dictionaries

=> StarZip: Multi-corpora Text Compr. System M5 research group, University of Central Florida

Weifeng Sun3

20 November 2002

Page 4: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Transform paradigm

M5 research group, University of Central Florida

Weifeng Sun4

20 November 2002

Original text:She owns a hotel.

Transformed text:aa~ aD a aaU.

TransformEncoding

Comp. Algorithm(PPM, Huffman)

Compressed text:(binary code)Dictionary

Original text:She owns a hotel.

Transformed text:aa~ aD a aaU.

TransformDecoding

Decomp. Algor.(PPM, Huffman)

Figure 1. Text transform paradigm

•Popular idea in image compression!

•BWT falls in this category (MTF + entropy coder).

Page 5: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Compression Philosophy

• Transform the text into some intermediate form which can be compressed with better efficiency.

• Exploit the natural redundancy of the language in making this transformation.

M5 research group, University of Central Florida

Weifeng Sun5

20 November 2002

Page 6: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Star-family Review

• *-Transform• Originally Proposed by Dr. Amar Mukherjee

• LPT(Length Preserving Transformation)

• RLPT(Reverse Length Preserving Transformation)

• SCLPT(Shortened-Context Length Preserving Transform)

• LIPT(Length-Index Preserving Transformation)

• StarNT (Ternary search tree + new mapping)

M5 research group, University of Central Florida

Weifeng Sun6

20 November 2002

Page 7: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

*-Encoding Replace each character in the input

word by a special placeholder character ‘*’ and retains at most two characters from the original word.

Preserve the length of the original word. Example

‘a’ --> ‘*’ ‘am’ --> ‘*a’ ‘there’ --> ‘*****’ ‘which’ --> ‘a*****’

M5 research group, University of Central Florida

Weifeng Sun7

20 November 2002

Page 8: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

*-transform demo

M5 research group, University of Central Florida

Weifeng Sun8

20 November 2002

Text dictionary

a *is **to *athe ***long ****this ***atest ***bmethod ******example *******sentence ********demonstrate ***********

Input textThis is a long example to

demonstrate the “substitution” method.

Encoded text***a^ ** * **** ******* *a

*********** *** “substitution” ******.

Page 9: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

LIPT

• Improvement upon *-transform• The run-length step of Bzip2 destroys repeated ‘*’.

• ->LPT -> RLPT -> SCLPT

• ->LIPT(Length-Index Preserving Transformation)

• First: encoding words according to length information• Second: considering frequency, partially sorted• Using binary tree (sort dictionary first, very slow)

M5 research group, University of Central Florida

Weifeng Sun9

20 November 2002

Page 10: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: New Transform

Improvement?

M5 research group, University of Central Florida

Weifeng Sun10

20 November 2002

Page 11: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT:Consideration 1

M5 research group, University of Central Florida

Weifeng Sun11

20 November 2002

Figure 2: Frequency of words versus length of words in English text

0% 5% 10% 15% 20% 25%

1

4

7

10

13

Leng

th o

f Wor

ds

Frequency of Words

Page 12: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Consideration 2

• Goal: • Make the transformed immediate output more

“delicious” to the backend compressor

• How?• Maintain some of the original context information

• Preserve word frequency information• Use word length information

• Provide some kind of “artificial” but strong context

M5 research group, University of Central Florida

Weifeng Sun12

20 November 2002

Page 13: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Consideration 3

• Fast transform encoding and decoding• Ternary search tree: encoding phrase

Searching for a string of length k in a ternary search tree with n strings will require at most O(log n+k) comparisons

• Better mapping: decoding phrase Searching for a word at time complexity O(1)

M5 research group, University of Central Florida

Weifeng Sun13

20 November 2002

Page 14: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Ternary Search Tree

• Hash table (fast, difficult design. slow unsuccessful search)

• Binary tree (slower, space efficient)

• Digital search tries (fast, exorbitant space requirement)

• Ternary search trees (fast & space efficient)

M5 research group, University of Central Florida

Weifeng Sun14

20 November 2002

Page 15: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Length Index Information (LII) Level 1 Index

Length Word Index Ptr

.

.

.

.

.

.

.

.

.

.

12345

Max -1Max

Word Index Information (WII)Level 2 Index

Words of length 1

Words of length 2

Words of length 3

Words of length Max

.

.

.

.

.

.

.

.

.

.

.

.

.

Indicates that word of length Max-1 is ‘0’.

‘a’ list‘b’ list‘c’ list

‘y’ list‘z’ list

‘a’ list‘b’ list‘c’ list

‘y’ list‘z’ list

‘a’ list‘b’ list‘c’ list

‘x’ list‘z’ list

‘a’ list‘b’ list‘c’ list

‘x’ list‘z’ list

Dictionary HeaderDictionary Version

Major Minor MicroDate Updated yymmdd

Specification Name

LIPT: Dictionary Organization in Memory --based on binary tree15

Page 16: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Consideration 4

• Shorter transform immediate file size• The meaning of symbol ‘*’ changed!

M5 research group, University of Central Florida

Weifeng Sun16

20 November 2002

*-encoding, …, LIPT StarNT

Words in the dictionary Words not in the dictionary

Page 17: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Dictionary mapping (1)

 Most frequently used words are listed in the beginning of the dictionary. Totally there are 312 words in this group.

M5 research group, University of Central Florida

Weifeng Sun17

20 November 2002

Page 18: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Dictionary mapping (2)

The remaining words are stored in D according to their lengths. Words with longer lengths are stored after words with shorter lengths. Words with same length are sorted according to their frequency of occurrence.

M5 research group, University of Central Florida

Weifeng Sun18

20 November 2002

Page 19: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Dictionary mapping (3)

To achieve better compression performance for the backend data compression algorithm, only letters [a..zA..Z] are used to represent the codeword. • Also, fast transform decoding (the codeword

denotes the index of the word in the dictionary)

M5 research group, University of Central Florida

Weifeng Sun19

20 November 2002

Page 20: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Dictionary Demo Index Word Codeword• 1 the a• 2 of b• 3 to c…• 52 one Z• 53 out aa• …• 312 thousand eZ• 313 b fa• 3574 pink apL…• 54432 interconnectivity tfN

M5 research group, University of Central Florida

Weifeng Sun20

20 November 2002

Page 21: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Transform encoding and decoding

Replacer Special characters ('*', '~', '`', and '\')

M5 research group, University of Central Florida

Weifeng Sun21

20 November 2002

Page 22: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Experiment

M5 research group, University of Central Florida

Weifeng Sun22

20 November 2002

Results

& Conclusions

Page 23: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

Calgary Canterbury Textfiles (from Gutenberg)FileNames Actual Sizes FileNames Actual Sizes FileNames Actual Sizesbib 111261 alice29.txt 152089 1musk10.txt 1344739book1 768771 asyoulik.txt 125179 anne11.txt 586960book2 610856 cp.html 24603 world95.txt 2988578news 377109 fields.c 11150paper1 53161 grammar.lsp 3721paper2 82199 lcet10.txt 426754paper3 46526 plrabn12.txt 481861paper4 13286 xargs.1 4227paper5 11954 bible.txt 4047392paper6 38105 kjv.gutenberg 4846137progc 39611 world192.txt 2473400progl 71646progp 49379trans 93695

StarNT: Benchmark texts

M5 research group, University of Central Florida

Weifeng Sun23

20 November 2002

Page 24: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Backend compressor

• PPMD (order 5)• Bzip2 -9

• BWT + MTF + entropy coder • Gzip –9

• a variation of LZ77 algorithm + static Huffman

M5 research group, University of Central Florida

Weifeng Sun24

20 November 2002

Page 25: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Timing Performance -- Encoding speed

M5 research group, University of Central Florida

Weifeng Sun25

20 November 2002

Table 1: Comparison of Encoding Speed of Various Compressor with/without Transform (in seconds)

Corpus bzip2bzip2

+StarNT

bzip2+

LIPTgzip

gzip+

StarNT

gzip+ LIPT PPMD PPMD+

StarNTPPMD+

LIPT

Calgary 0.36 0.76 1.33 0.23 0.86 1.7 9.58 7.94 9.98

Canterbury 2.73 3.04 5.22 2.46 3.36 6.59 68.3 55.7 69.2

Gutenburg 4.09 4.4 7.01 2.28 3.78 9.67 95.4 75.2 90.9

AVERAGE 1.69 2.05 3.47 1.33 2.06 4.47 41.9 33.9 41.9

Page 26: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Timing Performance-- Decoding speed

M5 research group, University of Central Florida

Weifeng Sun26

20 November 2002

Table 2: Comparison of Decoding Speed of Various Compressor with/without Transform (in seconds)

Corpus bzip2bzip2

+StarNT

bzip2+

LIPTgzip

Gzip+

StarNT

gzip+

LIPTPPMD

PPMD+

StarNT

PPMD+

LIPT

Calgary 0.13 0.33 1.66 0.04 0.27 1.64 9.65 8.07 10.9

Canterbury 0.82 1.53 6.77 0.22 1.16 9.15 71.2 57.8 77.2

Gutenburg 1.15 2.22 8.46 0.29 1.44 7.99 95.4 76.9 98.7

AVERAGE 0.51 1 4.4 0.14 0.72 5.27 43 35 46.4

Page 27: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Timing Performance-- Conclusion (1)

• The average compression time using the new transform algorithm with bzip2 -9, gzip -9 and PPMD is 28.1% slower, 50.4% slower and 21.2% faster compared to the original bzip2 -9, gzip -9 and PPMD respectively.

• The average decompression time using the new transform algorithm with bzip2 -9, gzip -9 and PPMD is 1 and 6 times slower, and is 18.6% faster compared to the original bzip2 -9, gzip -9 and PPMD respectively. However, since the decoding process is fairly fast for bzip2 and gzip, this increase is negligible.

M5 research group, University of Central Florida

Weifeng Sun27

20 November 2002

Page 28: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Timing Performance-- Transform speed comparison

M5 research group, University of Central Florida

Weifeng Sun28

20 November 2002

Table 3: Comparison of Transform Encoding and Decoding Speed (in seconds)

 

 

 

 

Corpora

StarNT LIPT

Transform Encoding

Transform Decoding

Transform Encoding Transform Decoding

Calgary 0.42 0.18 1.66 1.45

Canterbury 1.26 0.85 5.7 5.56

Gutenburg 1.68 1.12 6.89 6.22

AVERAGE 0.89 0.54 3.75 3.58

Page 29: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Timing Performance-- Conclusion (2)

• For all corpora, the average transform encoding and decoding times using the new transform decrease about 76.3% and 84.9%, respectively, in comparison to times taken by LIPT.

• The decoding module runs faster than encoding module by 39.3% on average. The main reason is that the hash function used in the decoding phase is more efficient than the ternary search tree in the encoding module.

M5 research group, University of Central Florida

Weifeng Sun29

20 November 2002

Page 30: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Compression Results (BPC) Using StarNT

M5 research group, University of Central Florida

Weifeng Sun30

20 November 2002

paper5 11954 3.24 2.76 3.34 2.78 2.98 2.56paper4 13286 3.12 2.46 3.33 2.55 2.89 2.34paper6 38105 2.58 2.29 2.77 2.40 2.41 2.17progc 39611 2.53 2.32 2.68 2.45 2.36 2.17paper3 46526 2.72 2.28 3.11 2.47 2.58 2.24progp 49379 1.74 1.69 1.81 1.76 1.70 1.64paper1 53161 2.49 2.21 2.79 2.35 2.33 2.10progl 71646 1.74 1.58 1.80 1.65 1.68 1.51

paper2 82199 2.44 2.14 2.89 2.35 2.32 2.07trans 93695 1.53 1.22 1.61 1.25 1.47 1.14bib 111261 1.97 1.71 2.51 2.12 1.86 1.62

news 377109 2.52 2.29 3.06 2.57 2.35 2.16book2 610856 2.06 1.92 2.70 2.24 1.96 1.85

grammar.lsp 3721 2.76 2.42 2.68 2.38 2.36 2.06xargs.1 4227 3.33 2.90 3.32 2.87 2.94 2.57fields.c 11150 2.18 1.98 2.25 2.03 2.04 1.81cp.html 24603 2.48 2.01 2.60 2.13 2.26 1.85

asyoulik.txt 125179 2.53 2.27 3.12 2.58 2.47 2.24alice29.txt 152089 2.27 2.06 2.85 2.38 2.18 2.00lcet10.txt 426754 2.02 1.81 2.71 2.14 1.93 1.78

plrabn12.txt 481861 2.42 2.23 3.23 2.60 2.32 2.22world192.txt 2473400 1.58 1.36 2.33 1.87 1.49 1.30

bible.txt 4047392 1.67 1.53 2.33 1.87 1.60 1.47

anne11.txt 586960 2.22 2.05 3.02 2.47 2.13 2.011musk10.txt 1344739 2.08 1.88 2.91 2.34 1.91 1.82

File Size (byte) bzip2 –9 bzip2 –9+StarNT gzip –9 gzip –9

+StarNT PPMD PPMD+StarNT

book1 768771 2.42 2.28 3.25 2.66 2.30 2.24

kjv.gutenberg 4846137 1.66 1.55 2.34 1.94 1.57 1.47

world95.txt 2736128 1.57 1.34 2.37 1.89 1.49 1.29Average   2.28 2.02 2.70 2.25 2.14 1.92

Page 31: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Compression Performance

• Facilitated with StarNT, bzip2 -9, gzip -9 and PPMD an average improvement in compression ratio of 11.2% over bzip2 -9, 16.4% over gzip -9, and 10.2% over PPMD.

• The StarNT works better than LIPT when is applied with backend compressor.

• In conjunction with bzip2, our transform algorithm achieves a better compression performance than the original PPMD. Combined with the timing performance, we conclude that bzip2+StarNT is better than PPMD both in time complexity and compression performance.

M5 research group, University of Central Florida

Weifeng Sun31

20 November 2002

Page 32: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT:BPC comparison of new approaches based on BWT

M5 research group, University of Central Florida

Weifeng Sun32

20 November 2002

 

bib 111261 2.05 1.94 1.94 1.93 1.71

book1 768771 2.29 2.33 2.29 2.31 2.28

book2 610856 2.02 2.00 2.00 1.99 1.92

news 377109 2.55 2.47 2.48 2.45 2.29

paper1 53161 2.59 2.44 2.45 2.33 2.21

paper2 82199 2.49 2.39 2.39 2.26 2.14

progc 39611 2.68 2.47 2.51 2.44 2.32

progl 71646 1.86 1.70 1.71 1.66 1.58

progp 49379 1.85 1.69 1.71 1.72 1.69

File Size (byte) Mbswic[Arna00]

bks98[BKSh99]

best x of 2x-1 [Chap00]

bzip2+LIPT

bzip2+StarNT

trans 93695 1.63 1.47 1.48 1.47 1.22

Average   2.20 2.09 2.10 2.06 1.94

Page 33: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: BPC comparison of new approaches based on PPM

M5 research group, University of Central Florida

Weifeng Sun33

20 November 2002

bib 111261 1.86 1.84 1.83 1.62

book1 768771 2.22 2.39 2.23 2.24

book2 610856 1.92 1.97 1.91 1.85

news 377109 2.36 2.37 2.31 2.16

paper1 53161 2.33 2.32 2.21 2.10

paper2 82199 2.27 2.33 2.17 2.07

progc 39611 2.38 2.34 2.30 2.17

progl 71646 1.66 1.59 1.61 1.51

progp 49379 1.64 1.56 1.68 1.64

File Size (byte) Multi-alphabet CTW order 16 [SOIm00]

NEW[Effr00]

PPMD+LIPT

PPMD+StarNT

trans 93695 1.43 1.38 1.41 1.14

Average 2.01 2.01 1.97 1.85

Page 34: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip

M5 research group, University of Central Florida

Weifeng Sun34

20 November 2002

A Multi-corpora lossless Text

Compression Tool

Page 35: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip: A Multi-corpora lossless Text Compression Tool

• StarNT: transform engine• Domain-specific dictionaries

M5 research group, University of Central Florida

Weifeng Sun35

20 November 2002

Page 36: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip: preliminary experiment

• Five corpora used (from ibiblio.com)

M5 research group, University of Central Florida

Weifeng Sun36

20 November 2002

Corpus # of files Size Entries in the dictionaryLiterature 3064 1.2 G 60533

History 233 9.11M 39740

Political 969 33.4M 38464

Psychology 55 13.3M 45165 Computer

Network (RFC) 3237 145M 13987

Page 37: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip: preliminary experiment – gzip

M5 research group, University of Central Florida

Weifeng Sun37

20 November 2002

Corpus gzip gzip+cd gzip+sd sd/gzip ImprovementGutenberg corpus2. 87 2. 47 2. 31 20% 7%History corpus2. 35 2. 14 1. 92 18% 10%Political corpus2. 49 2. 09 1. 98 20% 5%Psychology corpus2. 63 2. 27 2. 1 20% 8%RFC corpus 1. 96 1. 75 1. 65 16% 6%Average 2. 46 2. 14 1. 99 19% 7%

Page 38: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip: preliminary experiment – bzip2

M5 research group, University of Central Florida

Weifeng Sun38

20 November 2002

Corpus bzip2 bzip2+cd bzip2+sd sd/bzip2 ImprovementGutenberg corpus2. 26 2. 09 1. 97 13% 6%History corpus1. 86 1. 78 1. 6 16% 10%Political corpus2. 11 1. 92 1. 81 14% 6%Psychology corpus2. 3 2. 13 1. 97 14% 7%RFC corpus 1. 48 1. 43 1. 39 7% 3%Average 2.00 1.87 1.75 13% 6%

Page 39: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarZip: preliminary experiment – PPMD

M5 research group, University of Central Florida

Weifeng Sun39

20 November 2002

Corpus ppmd ppmd+cd ppmd+sd sd/ppmd ImprovementGutenberg corpus2. 13 2. 02 1. 93 9% 4%History corpus 1. 8 1. 72 1. 58 12% 8%Political corpus2. 02 1. 85 1. 75 13% 5%Psychology corpus2. 21 2. 01 1. 88 15% 7%RFC corpus 1. 47 1. 41 1. 37 3% 3%Average 1. 93 1. 80 1. 70 10% 5%

Page 40: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT

M5 research group, University of Central Florida

Weifeng Sun40

20 November 2002

Review &

Open Topic

Page 41: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Review (1)

• Static dictionary• Fast encoding (Ternary search tree)• Fast decoding (Well-designed mapping)• Better compression ratio

• StarZip

M5 research group, University of Central Florida

Weifeng Sun41

20 November 2002

Page 42: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Review (2)

M5 research group, University of Central Florida

Weifeng Sun42

20 November 2002

10 100 1,000

0

0.5

1.0

2.0

1.5

2.5

Co

mp

res

sio

n (bi

ts

per

ch ara

cte

r)

Encoding speed (Kbytes per seconds)

Bzip2PPMD

Gzip

Bzip2+StarNT

PPMD+StarNT

Gzip+StarNT

Figure 4: Compression effectiveness versus compression speed

100 1,000 10,000Decoding speed (Kbytes per seconds)

Bzip2PPMD

Gzip

Bzip2+StarNTPPMD

+StarNT

Gzip+StarNT

0

0.5

1.0

2.0

1.5

2.5

Co

mp

res

sio

n (bi

ts

per

ch ara

cte

r)

Figure 5: Compression effectiveness versus decompression speed

Page 43: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: Open Topic

• Theoretical explanation • Flexible dictionary

• Only length and frequency information used. If semantic information used, multiple words can share same codeword.

• Expand dictionary dynamically (as LZ-family)

• Other approaches to improve PPM• Block prediction

M5 research group, University of Central Florida

Weifeng Sun43

20 November 2002

Page 44: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: References (1)

M5 research group, University of Central Florida

Weifeng Sun44

20 November 2002

[AwMu01] F. Awan and A. Mukherjee, "LIPT: A Lossless Text Transform to improve compression", Proceedings of International Conference on Information and Theory : Coding and Computing, IEEE Computer Society, Las Vegas Nevada, 2001.

[Arna00] Z. Arnavut, "Move-to-Front and Inversion Coding", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird, Utah, March 2000, pp. 193-202.

[BKSh99] B. Balkenhol, S. Kurtz , and Y. M. Shtarkov, "Modifications of the Burrows Wheeler Data Compression Algorithm", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, March 1999,pp. 188-197.

[BeSe97] J. L. Bentley and Robert Sedgewick, "Fast Algorithms for Sorting and Searching Strings", Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, January, 1997

[BuWh94] M. Burrows and D.J. Wheeler, "A Block-Sorting Lossless Data Compression Algorithm", SRC Research Report 124, Digital Systems Research Center, Palo Alto, CA, 1994.

[Chap00] B. Chapin, "Switching Between Two On-line List Update Algorithms for Higher Compression of Burrows-Wheeler Transformed Data", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, March 2000, pp. 183-191.

[Effr00] M. Effros, "PPM Performance with BWT Complexity: A New Method for Lossless Data Compression", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, March 2000, pp. 203-212.

Page 45: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

StarNT: References (2)

M5 research group, University of Central Florida

Weifeng Sun45

20 November 2002

[FrMu96] R. Franceschini and A. Mukherjee, "Data Compression Using Encrypted Text", Proceedings of the third Forum on Research and Technology, Advances on Digital Libraries, ADL 96, pp. 130-138.

[Howa93] P.G.Howard, "The Design and Analysis of Efficient Lossless Data Compression Systems", Ph.D. thesis. Providence, RI:Brown University, 1993.

[KrMu98] H. Kruse and A. Mukherjee, "Preprocessing Text to Improve Compression Ratios", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, 1998, pp. 556.

[Moff90] A. Moffat, "Implementing the PPM data Compression Scheme", IEEE Transaction on Communications, 38(11), pp.1917-1921, 1990

[SOIm00] K. Sadakane, T. Okazaki, and H. Imai, "Implementing the Context Tree Weighting Method for Text Compression", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, March 2000, pp. 123-132.

[Sewa00] J. Seward, "On the Performance of BWT Sorting Algorithms", Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird, March 2000, pp. 173-182.

Page 46: M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer

END

Thank you!