Upload
nguyenminh
View
228
Download
4
Embed Size (px)
Citation preview
MASSIVELY PARALLEL LDPC DECODING ON GPU
Vivek Tulsidas BhatPriyank Gupta
“Workload Partitioning” Priyank
Motivation and LDPC introduction. Analysis of the sequential algorithm and
build up to the parallelization strategy. Lessons Learned : Part 1
Vivek Parallelization strategy Results and Discussion Lessons Learned : Part 2 Conclusion
Motivation FEC codes used extensively in various
applications to ensure reliability in communication.
Current trends in application show demands in increased data rates.
Considering Shannon Limit, low complexity encoders-decoders necessary.
Enter LDPC : Low-Density Parity Check.
LDPC : Quick Overview
Iterative approach. Inherently data-parallel Computationally
expensive. Therefore, perfect
candidate for operations that can be parallelized.
Our Initial Approach
Parallel Code Flow
Found Codeword or Max Iter. Report Results
Likelihood Ratio Initialization
Probability Ratio Initialization
Likelihood Ratio Recomputation
Probability Ratio Recomputation
Next Guess Calculation
No Yes
Analysis of Sequential Code
Sparse Matrix Representation
typedef struct /* Representation of a sparse matrix */{ int n_rows; /* Number of rows in the matrix */ int n_cols; /* Number of columns in the matrix */
mod2entry *rows; /* Ptr to array of row headers */ mod2entry *cols; /* Ptr to array of column headers */
mod2block *blocks; /* Allocated Blocks*/ mod2entry *next_free; /* Next free entry */
} mod2sparse;
typedef struct /* Structure representing a non-zero entry, or the header for a row or column */
{ int row, col; /* Row and column indexes */
mod2entry *left, *right, /* Pointers to adjacent entry in row */ *up, *down; /* and column, or to headers. Free */ /* entries are linked by 'left'.*/
double pr, lr; /* Probability and likelihood ratios - not used */ /* by the mod2sparse module itself */} mod2entry;
Likelihood Ratio Computation
LR_estimator = 1 (initial)Forward Transition:
element_LR(nth) = LR_estimator(nth)LR_estimator(n+1th) = LR_estimator(nth) *2/element_PR(n+1th) - 1
Reverse Transition:temp = element_LR(nth) * LR_estimator(nth)element_LR (n-1th) = (1-temp) / (1+temp)LR_estimator(n-1th) = LR_estimator(nth) *2/element_PR(n-1th) - 1
1 0 0 1 1 1 00 1 0 1 1 0 10 0 1 0 1 1 1
Probability Ratio Computation
1 0 0 1 1 1 00 1 0 1 1 0 10 0 1 0 1 1 1
PR_estimator(nth) = Likelihood_Ratio (nth) (initial) Top-Down Transition:
element_PR(nth) = PR_estimator(nth) PR_estimator(n+1th) = PR_estimator(nth) * element_LR(nth)
Bottom-Up Transition:element_PR (n-1th) = element_PR (nth) * PR_estimator(nth) PR_estimator(n-1th) = PR_estimator(nth) * element_LR(nth)
Lessons Learned : Part 1
"entities must not be multiplied beyond necessity"
Parallelization Strategy
Transformation
Codeword i
Likelihood Ratio Computation
Probability Ratio Recomputation
Next Guess Calculation
Found Codeword or Max Iter.
No YesReport Results
Codeword i-2 Codeword i-1 Codeword i+1 Codeword i+2
Use 1-D arrays
BSC Channel Data (N , M-bit codewords read at a time)
BSC Data Array with N codewords aligned
Likelihood ratio for all the MN bits
Bit Probabilities for MN bits
Decoded Blocks (N M-bit codewords)
Each thread does the computation for one-bit. So for N M-bit codewords, we would need MN threads for the Likelihood ratio, Probability Ratio and Decoded Block related computations
Likelihood Ratio Computation : Revisited
1 0 0 1 1 1 00 1 0 1 1 0 10 0 1 0 1 1 1
Likelihood Ratio Estimator calculation for Forward and Reverse Estimation done on the host before the launch of the Likelihood ratio kernel.
Note: Illustration for just one codeword. This is done for N codewords at a time.
Likelihood Ratio Estimator : Reverse Estimation
Likelihood Ratio Estimator : Forward Estimation
Probability Ratio Computation : Revisited
1 0 0 1 1 1 00 1 0 1 1 0 10 0 1 0 1 1 1
Likewise for the Probability Ratio Computation, only this time operations are done on a column basis
Probability Ratio Estimator : Top Down Transition
Probability Ratio Estimator : Bottom-Up Transition
Salient Features of our implementation Usage of efficient sparse matrix representation
of standard Parity-Check matrix. Simplistic Mathematical model for likelihood
ratio and probability ratio computation. Dedicated data structure for likelihood ratio and
probability ratio kernels. Code is easily customizable for different code
rates. Supports larger number of code words without
any major change to the program architecture.
Experimental SetupCPU GPU1 GPU2
Platform Intel Core 2 Duo
NVidia GeForce 8400 GS
NVidiaGeForce GT120
Clock Speed (Memory Clock)
2.6GHz 900MHz 500MHz
Memory 4GB 512MB 512MBCUDA Toolkit Version
-NA- 2.3 2.2
Programming Environment
Linux Visual Studio
Linux
Results (1/3) Tested extensively for code rate of (3,7) on
BSC channel with error probability of 0.05. Optimal execution configuration :
numThreadsPerBlock = 256, numBlocks = 7* Mul_factor where mul_factor is evaluated depending on the number of code words to be decoded
mul_factor = num_codewords / numThreadsPerBlock Bit error rate is evaluated by comparing
percentage change with respect to original source file.
Results (2/3) : Software Execution Time
Results (3/3) : Bit Error Rate Curve
Lessons Learned : Part 2 High occupancy does not guarantee better performance. Although GPU implementation provides considerable speedup, its
BER results are not attractive (in fact worse than CPU based implementation)
Absence of a double-precision floating point unit in GPU impacted the results. Probability ratio and Likelihood ratio computations are based on double-precision arithmetic.
Reliability? Random Bit Flips ? Could be catastrophic depending on the application for which LDPC decoding is being used.
Other programming paradigms : OpenMP ? Not as attractive in terms of speedup compared to GPU, but better BER curve.
Case for built-in ECC features within GPU architecture : NVIDIA Fermi architecture!
Future Work Trying this for AWGN channel for
different error probabilities. How does this perform on better GPU
architectures ? Tesla ? Fermi ? Any other parallelization strategies ?
CuBLAS routines for sparse matrix computations on GPU ?
Acknowledgement
We would like to thank Prof. Ali Akoglu and Murat Arabaci (OCSL Lab) for guiding us throughout the course of this project.
References Gabriel Falcao, Leonel Sousa, Vitor Silva,
“How GPUs can outperform ASICs for Fast LDPC Decoding”, ICS’09.
Gabriel Falcao, Leonel Sousa, Vitor Silva, “ Parallel LDPC Decoding on the Cell/B.E. Processor”, HiPEAC 2009.
Gregory M. Striemer, Ali Akoglu, “An Adaptive LDPC Engine for Space Based Communication Systems”.
Questions : Ask!
Backup Slides
Code Transformation: Likelihood ratio Init Kernel
Code Transformation: Initprp Decode Kernel
Code Transformation: Likelihood Ratio Kernel
Code Transformation: Probability Ratio Kernel
Code Transformation: Next Guess Kernel