High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering

High Throughput Compression of Double-Precision Floating-Point Data

Martin Burtscher and Paruj Ratanaworabhan

School of Electrical and Computer Engineering

Cornell University

Fast Floating-Point Compression

Introduction Scientific programs

Produce and transfer lots of 64-bit FP data Exchange 100s of MB/s, generate 1TB/day of new data

Large amounts of data Are expensive to store and transfer Take a long time to transfer

Data compression Can reduce amount of data Can speed up transfer


IEEE 754 Double-Precision Values Goal

Compress linear streams of FP data fast and well Online operation and lossless compression

Challenges Floating-point data are hard to compress

FP codes may generate over 90% unique values

Related work on lossless FP compression Focuses on 32-bit single-precision values Relies on smoothness of data or known geometry


Floating-Point Data Compression Our approach

Predict FP data with value prediction algorithms and encode the difference

Format:

Value predictors Hardware devices to speed up processors Predict instruction result by extrapolating

previously sequences of computed results Employ very fast and simple algorithms

63 62 52 51 0

S Exponent Mantissa


FPC Algorithm

Make two predictions Select closer value XOR with true value Count leading zeros Encode value Update predictors

64

FCM DFCM 64 64

3f82 4… 3f51 9…

compare compare

predictor closercode value

1 64leading

zero bytecounter

encoder

bita cnta bitb cntb remaindera

x y 0 2 z

. . .

compressedstream

3f82 3b1e 0e32 f39d. . .

uncompressed 1Dstream of doubles

selector

double

XOR

remainderb. . . . . .

1+3 0 to 8 bytes

7129 889b 0e5d


Algorithm/Implementation Co-Design Inner loop (about 50 and 70 C statements)

Compresses or decompresses one block of data Accounts for over 90% of execution time

Loop body optimizations Loop body is used to hide memory latency No fp, int mult, or int div instructions No branches (only conditional moves) Single basic block (>100 machine instructions) Average IPC > 5.4 and 5.1 on Itanium 2


Evaluation Method System

1.6 GHz Itanium 2, Intel C Itanium Compiler 9.1 Red Hat Enterprise Linux AS4

Scientific datasets Linear streams of 64-bit FP data (18 – 277MB) 4 observations: spitzer, temp, error, info 4 simulations: comet, plasma, brain, control 5 messages: bt, lu, sp, sppm, sweep3d


Compression Throughput

0

1

2

3

4

5

6

1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0compression ratio

com

pres

sion

thr

ough

put

(Gb/

s)

BZIP2

GZIP

PLMI

FSD

DFCM

FPC


Decompression Throughput

0

1

2

3

4

5

6

7

1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0compression ratio

de

com

pre

ssio

n t

hro

ug

hp

ut

(Gb

/s)

BZIP2DFCMFPCFSDGZIPPLMI


Summary and Conclusions FPC algorithm

Highest throughput and mean compression ratio 1.02 – 15.05 absolute compression ratio 840 and 680 MB/s throughput on a 1.6GHz

Itanium 2 (= 2 and 2.5 machine cycles per byte) http://www.csl.cornell.edu/~burtscher/research/FPC/

Conclusions Value predictors are fast & accurate data models Algorithm/implementation co-design is essential

Documents

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering