1
Fine-Grain Parallelization of Entropy Coding on GPGPUs Ana Balevic Abstract Massively parallel GPUs are increasingly used for acceleration of data-parallel functions, such as image transforms and motion estimation. The last stage, i.e. entropy coding, is typically executed on the CPU due to inherent data dependencies in lossless compression algorithms. We propose a simple way of efficiently dealing with these dependencies, and present two novel parallel compression algorithms specifically designed for efficient execution on many-core architectures. By fine-grain parallelization of lossless compression algorithms we can significantly speed-up the entropy coding and enable completion of all encoding phases on the GPGPU. Parallelization of Entropy Coding on GPGPUs GPGPU Tesla Architecture I Parallel array of streaming processors (SPs) I Large number of lightweight threads I Shared memory space for each block of threads (e.g. 256 threads) I Fine-grain scheduling as a mechanism for memory latency hiding Algorithm Parallelization Parallelization Levels I. Data partitioning II. Parallelization on the block level a) Coarse-grain (one thread processes one data block) b) Fine-grain (one thread block processes one data block) Fine-Grain Parallel Algorithm Design I. Assignment of codes to the input data II. Computation of destination memory locations for the codes III. Concurrent output of the codes to the compressed data array Parallel Variable Length Encoding Algorithm (PAVLE) PAVLE Algorithm Steps I. Assignment of codewords to source data I Static codeword table (e.g. Huffman codewords as in JPEG standard) I Codewords can be also computed on the fly (e.g. Golomb codes) II. Computation of output positions I Bit-accurate computation of the output position. The output position is determined by the memory address and the starting bit position. I Efficient parallel computation using prefix sum primitive III. Parallel bit-level output I Parallelized writing of variable-length codewords I Safe handling of race conditions via atomic operations Parallel Run Length Encoding Algorithm (PARLE) I Repetitions of a symbol are replaced by a (symbol, run length) pair I Start and ends of symbol runs can be identified by parallel flag generation I Output positions for RLE pairs also efficiently computed using prefix sum I Run lengths accumulated using reduction or computed by differencing Performance Results Algorithm complexity: Algorithm work complexity: O (n ), Parallel run time O (log n ). Dom- inant component: Prefix sum primitive. Test EnvironmentL Intel 2.66GHz QuadCore CPU, 2GB RAM, nVidia GeForce GTX280 GPU w. SM atomic instructions on 32-bit words. Serial encoder and scan kernel given as a reference. Performance Results Huffman encoding using PAVLE achieves a significant speed-up on a GPU in comparison to a serial (CPU) based encoder. An optimized version of PARLE using shared memory atomics achieves ˜ 35x speed-up, PARLE achieves 6x speed-up. 0.25 0.5 1 2 4 8 16 32 48 0.25 0.5 1 2 5 10 20 50 100 200 500 Data size [MB] Time [ms] vle_kernel_gm1 vle_kernel_sm1 scan cpu encoder 1 2 4 8 16 32 0.25 0.5 1 2 5 10 20 50 Data size [MB] Time [ms] cpu rle gpu parle gpu scan Conclusion The fine-grain parallel compression algorithms for fixed-length and variable-length encoding achieved up to 6x (PARLE) and 35x (PAVLE) speed-up on the nVidia GeForce GTX280 GPGPU supporting CUDA1.3. These two algorithms can be easily combined with readily available data-parallel transforms and motion-estimation algorithms. In this way, we provide attractive parallel algorithm building blocks for compression, which enable completion of all coding stages of GPU-accelerated codecs directly on a GPGPU. With this approach, not only do we significantly speed-up the entropy coding process (while relieving the CPU), but also reduce the data transfer time from the GPU to system memory. PAVLE and PARLE will be released as open source in the upcoming CUDA Compression library. Ana Balevic, 2009. Contact: [email protected] Source Code: http://svn2.xp-dev.com/svn/ana-b-pavle/

Fine-Grain Parallelization of Entropy Coding on GPGPUstesla.rcub.bg.ac.rs/~taucet/docs/papers/hipeac_poster_ab.pdf · Fine-Grain Parallelization of Entropy Coding on GPGPUs ... (PARLE)

Embed Size (px)

Citation preview

Page 1: Fine-Grain Parallelization of Entropy Coding on GPGPUstesla.rcub.bg.ac.rs/~taucet/docs/papers/hipeac_poster_ab.pdf · Fine-Grain Parallelization of Entropy Coding on GPGPUs ... (PARLE)

Fine-Grain Parallelization of EntropyCoding on GPGPUs

Ana BalevicAbstract

Massively parallel GPUs are increasingly used for acceleration of data-parallel functions, such as image transforms and motion estimation. The laststage, i.e. entropy coding, is typically executed on the CPU due to inherent data dependencies in lossless compression algorithms. We propose asimple way of efficiently dealing with these dependencies, and present two novel parallel compression algorithms specifically designed for efficientexecution on many-core architectures. By fine-grain parallelization of lossless compression algorithms we can significantly speed-up the entropycoding and enable completion of all encoding phases on the GPGPU.

Parallelization of Entropy Coding on GPGPUs

GPGPU Tesla Architecture

I Parallel array of streaming processors (SPs)I Large number of lightweight threadsI Shared memory space for each block of threads (e.g. 256 threads)I Fine-grain scheduling as a mechanism for memory latency hiding

Algorithm Parallelization

Parallelization LevelsI. Data partitioningII. Parallelization on the block level

a) Coarse-grain (one thread processes one data block)b) Fine-grain (one thread block processes one data block)

Fine-Grain Parallel Algorithm DesignI. Assignment of codes to the input dataII. Computation of destination memory locations for the codesIII. Concurrent output of the codes to the compressed data array

Parallel Variable Length Encoding Algorithm (PAVLE)

PAVLE Algorithm Steps

I. Assignment of codewords to source dataI Static codeword table (e.g. Huffman codewords as in JPEG standard)I Codewords can be also computed on the fly (e.g. Golomb codes)

II. Computation of output positionsI Bit-accurate computation of the output position.

The output position is determined by the memory address and the starting bit position.I Efficient parallel computation using prefix sum primitive

III. Parallel bit-level outputI Parallelized writing of variable-length codewordsI Safe handling of race conditions via atomic operations

Parallel Run Length Encoding Algorithm (PARLE)

I Repetitions of a symbol are replaced by a (symbol, run length) pairI Start and ends of symbol runs can be identified by parallel flag generation

I Output positions for RLE pairs also efficiently computed using prefix sumI Run lengths accumulated using reduction or computed by differencing

Performance Results

Algorithm complexity: Algorithm work complexity: O(n), Parallel run time O(log n). Dom-inant component: Prefix sum primitive.

Test EnvironmentL Intel 2.66GHz QuadCore CPU, 2GB RAM, nVidia GeForce GTX280GPU w. SM atomic instructions on 32-bit words. Serial encoder and scan kernel given as areference.

Performance Results Huffman encoding using PAVLE achieves a significant speed-up ona GPU in comparison to a serial (CPU) based encoder. An optimized version of PARLEusing shared memory atomics achieves 3̃5x speed-up, PARLE achieves 6x speed-up.

0.25 0.5 1 2 4 8 16 32 48

0.25

0.5

1

2

5

10

20

50

100

200

500

Data size [MB]

Tim

e [m

s]

vle_kernel_gm1vle_kernel_sm1scancpu encoder

1 2 4 8 16 32

0.25

0.5

1

2

5

10

20

50

Data size [MB]

Tim

e [m

s]

cpu rlegpu parlegpu scan

Conclusion

The fine-grain parallel compression algorithms for fixed-length and variable-length encoding achieved up to 6x (PARLE) and 35x (PAVLE) speed-up onthe nVidia GeForce GTX280 GPGPU supporting CUDA1.3. These two algorithms can be easily combined with readily available data-paralleltransforms and motion-estimation algorithms. In this way, we provide attractive parallel algorithm building blocks for compression, which enablecompletion of all coding stages of GPU-accelerated codecs directly on a GPGPU. With this approach, not only do we significantly speed-up theentropy coding process (while relieving the CPU), but also reduce the data transfer time from the GPU to system memory. PAVLE and PARLE will bereleased as open source in the upcoming CUDA Compression library.

Ana Balevic, 2009. Contact: [email protected] Source Code: http://svn2.xp-dev.com/svn/ana-b-pavle/