VLSI Implementation of a High-Throughput Soft-Bit-Flipping Decoder for Geometric LDPC Codes

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 5, MAY 2010 1083

VLSI Implementation of a High-ThroughputSoft-Bit-Flipping Decoder for

Geometric LDPC CodesJunho Cho, Member, IEEE, Jonghong Kim, and Wonyong Sung, Senior Member, IEEE

Abstract—VLSI-based decoding of geometric low-densityparity-check (LDPC) codes using the sum–product or min-sumalgorithms is known to be very difficult due to large memoryrequirement and high interconnection complexity caused by highvariable and column degrees. In this paper, a low-complexityhigh-performance algorithm is introduced for decoding of suchhigh-weight LDPC codes. The developed soft-bit-flipping (SBF)algorithm operates in a similar way to the bit-flipping (BF) algo-rithm but further utilizes reliability of estimates to improve errorperformance. A hybrid decoding scheme comprised of the BF andSBF algorithms is also proposed to shorten the decoding time.Parallel and pipelined VLSI architecture is developed to increasethe throughput without consuming much chip area. The (1057,813) and (273, 191) projective-geometry LDPC codes are used forperformance evaluation, and the former is designed in VLSI.

Index Terms—Bit flipping (BF), finite geometry, low-densityparity-check (LDPC) codes, projective geometry (PG), soft BF(SBF).

I. INTRODUCTION

L OW-DENSITY PARITY-CHECK (LDPC) codes havebeen of great interest since late 1990s due to the

capacity-approaching error performance [1]–[4]. As LDPCcodes are recently used for many high bandwidth applications,VLSI-based implementation of the codes has been of very highinterest. The soft-decision-based sum-product algorithm (SPA)with infinite precision [5], [6] and the hard-decision-basedbit-flipping (BF) algorithm with 1-bit precision [2] are the twoextreme counterparts, which were developed in the earlieststage of the LDPC history. The SPA is generally known to showthe best error performance but is not easily implementedin hardware because of its high complexity, whereas theBF algorithm with low complexity exhibits far worse errorperformance than the SPA. They have yielded many variantsthat attempt to overcome their relative weaknesses.

TheLDPCcodesarecategorized intoseveralclassesaccordingto the construction method. Among them, projective geometry(PG)-based LDPC codes, also known as difference-set cyclic

Manuscript received September 29, 2009; revised January 05, 2010 andMarch 08, 2010; accepted March 08, 2010. First published May 10, 2010;current version published May 21, 2010. This work was supported in part by theMinistry of Education, Science and Technology (MEST), Republic of Korea,under the Brain Korea 21 Project and in part by the National Research Founda-tion funded by the Korea government (MEST) under Grant 20090075770. Thiswork was recommended by Associate Editor C.-C. Wang.

The authors are with the School of Electrical Engineering and ComputerScience, Seoul National University, Seoul 151-744, Korea (e-mail: [email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSI.2010.2047743

codes, show good error performance with various decodingalgorithms [7]. The PG-LDPC codes have cyclic or quasi-cycliccode structure that leads to regular encoder and decoderimplementations. Note that the cyclic or quasi-cyclic codestructure has various implementation advantages, such asencoding with linear feedback shift registers. The PG-LDPCcodes have large minimum distance that prevents the occurrenceof error floor. The error floor is, in many cases, considered aserious disadvantage of well-performing random LDPC codescompared to the arithmetic codes, such as Bose–Chaudhuri-Hocquenghem and Reed–Solomon codes. The PG-LDPC codeswith these excellent properties, however, have been out ofpractical concern due to their large row and column weightsthat make the hardware implementation infeasible with thehigh-performance soft-decision decoding algorithms, such asthe SPA and min-sum algorithm (MSA).

We have developed the soft BF (SBF) algorithm to takeadvantage of both SPA and BF algorithms in [8]. The devel-oped SBF algorithm enables a low-cost implementation ofhigh-weight LDPC codes, such as the PG-LDPC codes, whilesustaining good performance close to that of the SPA. The SBFalgorithm follows the basic structure of the BF algorithm toreduce the interconnection complexity and the amount of com-putation, and it applies reliability estimation that imitates theSPA to bridge the error performance gap. A hybrid decodingscheme using both BF and SBF algorithms has been alsoproposed to shorten decoding time [8]. In this paper, the SBFis introduced with more descriptions for determining initialflipping thresholds, and the hybrid decoding scheme is elabo-rated in more details. Then, an efficient hybrid SBF decoder isimplemented in VLSI for the (1057, 813) PG-LDPC code. Thedeveloped hybrid SBF decoder is parallelized and pipelined toincrease the throughput. Simulation results demonstrate thatthe hybrid SBF decoder can achieve a peak throughput of 17.37Gbits/s with only 7.4- area when implemented in 0.18-CMOS technology. It shows 0.2-dB close performance to thefloating-point SPA at a bit error rate of . To the best of ourknowledge, this is the first VLSI design for the practical use ofhigh-weight PG-LDPC codes.

This paper is organized as follows. Section II introducesvarious LDPC decoding algorithms. The SBF algorithm isdescribed in Section III, and the hybrid decoding scheme is pre-sented in Section IV. Architecture of the developed hybrid SBFdecoder is then illustrated in Section V. The results of VLSIimplementation are shown in Section VI. Finally, concludingremarks are made in Section VII.

1549-8328/$26.00 © 2010 IEEE

1084 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 5, MAY 2010

TABLE ITHRESHOLD VALUES OF THE 4-BIT SBF ALGORITHM FOR

THE (1057, 813) PG-LDPC CODE

II. LDPC DECODING ALGORITHMS AND MARGINALIZATION

Many message-passing-based algorithms have been devel-oped for decoding of LDPC codes, including the SPA [5], [6],MSA [6], [9], [10], BF [2], and weighted BF (WBF) [7] al-gorithms. These algorithms exchange estimates of nodes iter-atively between variable and check nodes along the edges de-fined by the Tanner graph, thereby augmenting reliability of theestimation at every iteration. Among the various algorithms, theSPA is considered to show the best error performance but alsodemands the largest implementation complexity in terms of in-terconnection, as well as computation. The SPA computes anoutgoing message of a node along an adjacent edge usingall incoming messages, excluding the message along edge .This exclusion of the incoming message, called marginaliza-tion, ensures that only extrinsic information is exploited to ob-tain new estimation [4]. This is known to be an important prop-erty to improve error performance but is a primary factor to in-duce high computation and interconnection complexities. TheMSA is an approximate algorithm of the SPA, which has smallercomputational but still large interconnection complexities. TheBF algorithm generally refers to a very simple hard-decisionalgorithm, which removes marginalization from either the Gal-lager’s algorithm A or B. While the SPA and the MSA propa-gate the probability of nodes repeatedly in the real number do-main, the BF algorithm only changes the hard-decision valuesreflecting the probability. More specifically, denoting the vari-able node degree as , the variable nodes with or more unsat-isfied parity-check sums are inverted in the BF algorithm, wherethe threshold is fixed at (Gallager algorithm A) or itcan vary during decoding (Gallager algorithm B). The BF al-gorithm has significantly smaller complexities but shows muchworse error performance than the SPA or the MSA. The WBF,modified WBF (MWBF) [11], and improved MWBF (IMWBF)[12] algorithms originate from the BF algorithm and try to im-prove the poor error performance of the BF algorithm by as-signing weights for check nodes, which represent reliability ofhard decision.

Although marginalization is an important operation to im-prove error performance, it entails additional computations.Furthermore, marginalization induces times larger memory,where denotes the number of edges per node. It is because,

Fig. 1. Distribution of the flipping function for (dashed line) correct and (solidline) erroneous variables and the corresponding flipping threshold � ’s.

with marginalization, different messages need to be stored anddelivered to different neighbors, whereas without marginal-ization, universal messages are broadcasted from a node.Therefore, for PG-LDPC codes with relatively high weights,marginalization makes their implementation infeasible; e.g.,the (1057, 813) PG-LDPC code has variable and check nodedegrees of 33, and the (4161, 3431) PG-LDPC code has thoseof 64. Pseudomarginalization proposed by the MWBF algo-rithm seems a proper remedy for such high-weight LDPCcodes. When estimating the next status of variable nodes usingpseudomarginalization, the incoming messages from checknodes are subtracted by a particular amount of received values,thereby quite alleviating accumulation of intrinsic informationthrough iterations while demanding almost the same memoryregardless of the node degree.

III. SBF ALGORITHM

The SBF algorithm follows the basic structure of theMWBF algorithm employing pseudomarginalization but usesrefined flipping criteria over soft-valued variable nodes toachieve better error performance [8]. Let us assume the binaryphase-shift keying modulation over an additive white Gaussiannoise channel. The weight of the MWBF is then defined by

for check node and that of the IMWBFby , applying marginalization, where

, , and denote the set of neighboring variablenodes of check node , excluding the variable node

, and the th received variable, respectively. Note that thelog-likelihood ratio (LLR) of the received signal is com-puted by , where is the noise variance. The weightmarginalization adopted in the IMWBF algorithm is not usedin this paper because it causes too much hardware increasefor PG-LDPC codes. Moreover, to calculate the weight, thesummation is taken over all neighboring variable nodes ratherthan finding the minimum of them. It is because the minimumis only suitable for accurate reliability computation based on

CHO et al.: VLSI IMPLEMENTATION OF A HIGH-THROUGHPUT SBF DECODER FOR GEOMETRIC LDPC CODES 1085

Fig. 2. FER of the (1057, 813) PG-LDPC code under various decodingalgorithms.

LLR and exact marginalization. Thus, in the SBF algorithm,the weight for the th check node is computed by

(1)

The flipping function for variable node is the same as those ofthe MWBF and IMWBF algorithms, which is defined by

(2)

where , , and denote the setof neighboring check nodes of variable node , the th syn-drome component, and a nonnegative integer constant, respec-tively. The flipping function indicates the extent of disagree-ment to the current hard-decision bit of variable node pro-vided by the soft-valued parity-check sums. Note that all theconstituent computations of (2) can be performed within vari-able node , where the subtraction of describes pseudo-marginalization.

In order to improve error performance without exact marginal-ization in both variable and check nodes, the SBF algo-rithm adopts the following two methods. First is the nonuni-form quantization. Suppose that the variable nodes are quan-tized into bits using the nonuniform quantization levels of

and so are the check nodes using. Although it is simple to find effi-

cient quantization levels for the SPA or the MSA becausethey operate in the exact probability domain, it seems notanalytical to calculate them for the simplified BF-based algo-rithms. Therefore, in this paper, good quantization levels areroughly found by simulation, where the results for aredepicted in Table I. Good error performance is achieved whenthe variables with small magnitudes are assigned with finequantization levels. Given the initial quantization levels, smallchanges are made to each level by either changing more orreserving the older value, and we keep track of good levels,yielding better error performance. The quantized values are

Fig. 3. (Dashed line) Frame and (solid line) bit error rates of the (273,191) PG-LDPC code.

Fig. 4. Evolution of the FER through iterations when � �� and (solid line) 4 dB.

represented in a sign-magnitude form except 0 for ease ofimplementation. The second method is to reflect the flippingfunction on the soft-valued variable nodes with more flippingthresholds rather than on the hard-decision bits with just onethreshold, as in the MWBF and IMWBF algorithms. The SBFalgorithm employs four levels of flipping strength, namely,strong flip, weak flip, maintaining, and strengthening. They aredetermined by the three flipping thresholds of .The strong and weak flips imply the shifts of a signal towardthe direction of its opposite sign. Therefore, they are not“flips” in a strict sense, but we used those terms to inherit theconventional terms employed in the modified BF algorithms.A similar method has been proposed by Palanki et al. [13],in which the three-state BF algorithm allows the erased valuefor the variables and deactivation of parity-check sums. Thewider reflection of the flipping function increases the utilizationratio of the incoming information obtained at each iterationof decoding. Moreover, by reflecting the flipping functionnot on the hard-decision bits but on the soft-valued variablenodes, the SBF enables seamless propagation of informationthrough decoding iterations, as the SPA and MSA. Note that,in the MWBF and IMWBF algorithms, only one or a fewelements of the hard-decision vector that have the largestflipping function are inverted.


Fig. 5. (Dashed line) Frame and (solid line) bit error rates of the (1057, 813)PG-LDPC code. Iteration limits are set to 40.

Fig. 6. (Dashed line) Average number of iterations and (solid line) throughputfor the decoding of the (1057, 813) PG-LDPC code. Iteration limits are set to40.

Finally, the SBF algorithm is carried out as follows:

Initialization:

Set the iteration number to zero. Quantize all variablenodes using .

Step 1. Check node update:

For each check node , compute the syndromeand the weight , and then, quantize the weight

using .

Step 2. Decision:

If all the parity checks are satisfied, i.e., if all thesyndromes are zero, the hard-decision vector of thevariable sequence is produced as the decoded word,and finish the decoding. If is larger than the predefinedlimit , finish the decoding with a decoding failuremessage.

Fig. 7. Bit error rate of the (1057, 813) PG-LDPC code under hybrid decodingschemes.

Fig. 8. Average number of iterations under hybrid decoding schemes.

Step 3. Variable node update:

For each variable node , compute the flippingfunction defined by (2). Then, according to theflipping function, update the variable node with SBFas follows.

a) (Strong flip) If ,

b) (Weak flip) If ,

c) (Maintain) If ,

d) (Strengthen) If ,

If no flipping occurs, adjust the flipping thresholdsand for some positive


Fig. 9. Serial architecture of a hybrid SBF decoder.

integer . Increment the iteration number ,and then, go to Step 1.

The three flipping thresholds , , and are initially set bythe predetermined values. By evaluating for a large numberof correct and erroneous variable nodes separately, the appro-priate initial values for ’s can be obtained, with which thecorrect ones are flipped less and the erroneous ones are flippedmore at the first iteration. An example distribution of the flip-ping function and the corresponding ’s are shown in Fig. 1.Then, the integer constants in (2) and in Step 3 are ob-tained by simulations to achieve the best error performance.Employing the quantization levels given in Table I, the param-eters are determined to , , and

.The frame error rates (FERs) of the (1057, 813) PG-LDPC

code under the SBF algorithm with various quantization levelsare shown in Fig. 2. For comparison, also shown are the errorperformances of the SPA, WBF, and IMWBF algorithms.The SBF algorithm is tested using fixed-point computations,whereas other algorithms are based on floating-point arith-metic. It is noticeable that the 6-bit SBF even outperformsthe SPA at high signal-to-noise ratio (SNR) region with thesame iteration limit of 200. Heuristic algorithms often showbetter error performance than the numerically accurate SPAat high SNR region because the approximate (inaccurate)computations seem less affected by the correlation between

messages, which deteriorates the error-correcting capability[10]. Meanwhile, as was observed earlier in [13] for the (255,127) Euclidean-geometry LDPC code, which is another classof geometric LDPC codes, the primary reason of the goodperformance of the BF-based algorithm seems to be due tothe large number of redundant parity-check sums (i.e., largevariable node degrees) of the geometric LDPC codes. Notethat LDPC codes with large variable node degrees, on the otherhand, make the implementation of marginalization-based SPAand MSA much costly. Moreover, note that any error floor isnot observed until near the FER of with as small as 2 bitsof quantization, when decoded by the SBF algorithm. Whentested under the (273, 191) PG-LDPC code with corresponding

, , ’s, and quantization levels, the SBF decoding alsoshows good error performance, as shown in Fig. 3. All geo-metric codes have good structural properties and large weightssince they are constructed based on finite geometries with justdifferent parameters. Therefore, it is expected that the SBFalgorithm generally shows good performance near that of theSPA when operated for this special class of LDPC codes.

IV. HYBRID DECODING WITH BF AND SBF ALGORITHMS

The SBF algorithm flips the variable nodes that exhibit almostthe largest flipping function, as explained in Step 3 of the algo-rithm description. Although such safe and conservative changeof variable nodes toward the opposite direction improves the


Fig. 10. Flow chart of the hybrid SBF decoding process.

error performance, more iterations are required for decodingconvergence. The error performances as a function of the it-eration number are shown in Fig. 4, where the fixed flippingthreshold is used for the BF algorithm. Even though theerror performance with a large number of iterations improves byincreasing the number of quantization bits , the decoding con-vergence becomes slower because a small number of variablenodes change their values by at most two among the dynamicrange of . On the other hand, the BF algorithm inverts a largernumber of variables aggressively with a fixed threshold, therebyachieving faster convergence.

The slow-decoding-convergence problem of the SBF algo-rithm can be alleviated by two-stage hybrid decoding; it firstattempts the fast BF decoding with a fixed threshold and avery small iteration limit and then employs the SBF decodingin the case of decoding failure with the BF decoding. In thispaper, compromising the error performance, convergencespeed, and implementation cost, the number of quantizationbits is determined to . Frame and bit error rates of the(1057, 813) PG-LDPC code under three decoding schemesgiven are shown in Fig. 5, where the hybrid SBFdecoding scheme assigns 4 and 36 iterations for the BF andSBF stages, respectively. The average number of iterations andthe throughput for these decoding schemes are shown in Fig. 6.The decoding throughput is estimated by simulation based onthe developed four-stage pipelined hybrid SBF decoder with aparallel factor of 64 and a clock frequency of 345 MHz. The

hybrid SBF decoding scheme shows fairly higher decodingthroughput than the SBF decoding near of 4.5–7.0dB. Meanwhile, when is below 4.0 dB or beyond 7.5dB, the decoding throughput of the SBF only and the hybridSBF schemes are almost same. It is because the BF algorithmbarely succeeds in decoding below 4.0 dB (see Fig. 2) andthe SBF algorithm reaches decoding convergence as fast asthe BF algorithm beyond 7.5 dB. When the SNR is largerthan 4.1 dB, which is the case of concern where the BER issmaller than roughly , at least 1.05 Gbits/s of throughputis accomplished. The throughput increases up to 17.37 Gbits/sif received words are perfectly error free.

The error performance and decoding time of the hybrid de-coding schemes with various combinations of decoding itera-tions are shown in Figs. 7 and 8, respectively. When we conductthe BF decoding for less than three iterations, the BF decodingfails frequently, and the decoding time does not change muchby the hybrid decoding (see the “ ” marked curve in Fig. 8);else, if we assign more than five iterations for the BF decoding,the reduced number of iterations for the SBF decoding deterio-rates the overall error performance (see the “ ” marked curve inFig. 7). If we vary the number of iterations for the BF decodingwithin three to five while maintaining the total number of it-erations for the hybrid decoding, almost the same performancesare observed for both the error rate and the decoding time. Thesecombinations of iterations seem good compromises between theerror and time performance.


Fig. 11. Structure of (a) VPU and (b) CPU.

V. ARCHITECTURE OF THE HYBRID SBF DECODER

A. Serial Architecture

A serial hybrid SBF decoder is developed in this paper asshown in Fig. 9. The decoder consists of an input buffer (Ibuff), arollback buffer (Rbuff), an output buffer (Obuff), variable nodes,check nodes, a flip unit, a variable node processing unit (VPU),and a check node processing unit (CPU), where the buffers andnodes are realized by shift registers. The 1-bit BF decoding and4-bit SBF decoding can be simply implemented in shared hard-ware since they are based on almost the same algorithm withjust different precision.

Once an -bit word is transmitted to the developed decoder,the received analog signals are quantized using the boundary

’s and stored in the 4-bit input buffers in a sequentialmanner by shifting the registers bit by bit (thus taking clockcycles). Then, the variable nodes are initialized in a single clockcycle by fetching the sign bits of the input buffer in a fullyparallel manner (see Fig. 10 for detailed decoding flow). In thefollowing check node update step, the VPU computes a paritycheck using only the sign bits of the connected 33 variable nodesand stores it to the corresponding check node, again in a serialmanner. The upper part of the VPU shown in Fig. 11(a) is usedin this step. While the check nodes are being updated, the inputbuffer transfers its sign to the rollback buffer and its magnitudeto the variable node for the case of BF decoding failure. At thesame time, the next received word fills the emptied input bufferin a first-in first-out (FIFO) manner. Also in this step, the1-bit output buffers containing the previously decoded word de-liver the outputs with a flag of either decoding success or failure,which was determined in the previous check node update step.

When this output is not a final decoding result but an interme-diate word, it is nullified with an output disable signal. Sim-ilar to the case of the input buffer, the sign bit of the currentvariable node fills the emptied output buffer in a FIFO manner,which will be the output in the next check node update step.If the current word being decoded is a legitimate codeword, all

parity-check sums are identified as satisfied after the checknode update step, and the next received word can be fetched ina single clock cycle because all the required data are alreadystored in the suitable position. In this way, i.e., by overlappingthe input and output (IO) buffering with the check node up-date and by fetching a received word in a single clock cycle, avery high throughput can be attained, particularly at high SNRwhere only a few iterations are needed for decoding success. Ifany of the check nodes are not satisfied, the variable node up-date step follows. In this step, the CPU adds up the parity bitsof the connected 33 check nodes and compares the result withthe fixed flipping threshold , thereby generating a flip-ping signal for the corresponding variable node. The sign bit ofeach variable node is updated by reflecting this signal. After theaforementioned first iteration of the check and variable node up-date steps, the check node update step is operated without fillingthe magnitude of the variable nodes and without input bufferinguntil four iterations assigned to the BF decoding are consumed.This can be implemented by clock gating.

If the BF decoding fails, the sign bits of the variable nodesare replaced by the contents of the rollback buffer. It takes onecycle to complete the rollback, as the initialization step. Now,the variable nodes contain the soft-valued received word. Uti-lizing all 4-bit precision and all hardware resources of VPU andCPU shown in Fig. 11, the SBF decoding is conducted. Let us


Fig. 12. Structure of variable nodes when the parallel factor is 32.

Fig. 13. Structure of the threshold adaptation unit.

denote the th 4-bit variable node as , where is theinverted 1-bit hard-decision value and is the 3-bit reliabilityof that hard-decision value. Then, in the check node update step,the VPU generates 4-bit parity checks in a sign-magnitude form;the most significant 1 bit representing the inverted hard-deci-sion parity check is obtained by inverting the modulo-2 sum

, and the remaining 3 bits is calculated by the or-dinary sum , followed by the quantization with thethreshold ’s [see Fig. 11(a)].

The result is then formatted to the 4-bit two’s complementrepresentation before being stored in the check node. The two’scomplement format is efficient for check nodes, whereas thesign-magnitude format is advantageous for variable nodes. Itis because the CPU requires the signed summation of checknodes, while the VPU sums up the absolute values of variablenodes. The format conversion unit is attached at the last stageof the VPU rather than at the first stage of the CPU because thecritical path is in the CPU. As a result of VPU operations, thecheck nodes obtain 4-bit values which have dynamic range of

and [0, 7]. However, the 4-bit SBF algorithm requiresthe dynamic range of and [1, 8] for conducting thevariable node update step. Therefore, a straightforward imple-mentation of the CPU requires conditional additions for every

input from the check nodes, which adds one to the magnitudewhen the sign is nonnegative. In our implementation, however,the dynamic range mismatch is simply compensated by sepa-rately adding the number of nonnegative check nodes connectedto the CPU, as well as adding the check nodes as they are, asshown in Fig. 11(b). In this way, the complexity of conditionaladditions can be quite alleviated. The lower part of Fig. 11(b)shows the hardware for threshold adaptation. If no flipping oc-curs in the variable node update step, even though there is at leastone unsatisfied parity check, is modified to the maximum re-liability value that has been detected in the last variable nodeupdate step, and also, is modified to . Both in theVPU and CPU, the Wallace tree and carry look-ahead addersare employed to minimize the critical path delay caused by thelarge input addition.

As described so far, the BF decoding can be easily imple-mented on the SBF decoder. Therefore, the hybrid SBF de-coding is a very efficient way to improve the throughput withonly a small increase in the implementation cost.

B. Parallel Architecture

The serial decoder can be converted to a parallel one bysimply duplicating the VPU, CPU (except the threshold


Fig. 14. Structure of (a) VPU and (b) CPU with four pipeline stages.

adaptor), flip unit, and corresponding interconnection networkstimes, where denotes the parallel factor. If the VPU of the

serial decoder is connected to a set of variable nodes, the th VPU of the parallel decoder

should be connected to .The buffers and nodes are vertically grouped by elementsto deliver proper messages to the VPUs and CPUs, where anexample of variable nodes with a parallel factor of 32 is shown

in Fig. 12. Note that the number of shift registers for buffersand nodes is almost unchanged by parallelization. Therefore,although the shift registers dominate the total cell area in theserial architecture, the portion of the computational units forVPUs and CPUs becomes larger as increases. It is also ex-pected that the power consumption in VPUs and CPUs becomesmore significant with increasing , since the signal transitionsin shift registers are not affected by . In the developed parallel


TABLE IIIMPLEMENTATION RESULTS

Measured at a BER of ��

� Error performance is measured with 50 iterations, while the decoding throughput is measured with 16 iterations.

� Error performance is measured with 32 iterations, while the decoding throughput is measured with 16 iterations.

SBF decoder, there is a unified threshold adaptation unit, asshown in Fig. 13. Although the initialization and rollbacktime is not changed, the time to complete the check node andvariable node update steps is reduced to clock cycles bythe parallelization.

The pipelining technique is applied to the parallel architectureto increase the clock frequency. For the four-stage pipelining,the Wallace tree should be partitioned into several subtrees, twopipeline registers are inserted between the front- and back-endsof the entire Wallace tree, and another pipeline register lies be-fore the quantization unit in the VPU and before the thresholdadaptation unit in the CPU (see Fig. 14).

With the -stage pipelining, the check node and variable nodeprocessing times are increased by clock cycles. Pipeliningis required also in the threshold adaptation unit, the path delayof which increases at the rate of . For the thresholdadaptation unit, many pipeline stages can be inserted if required,since the modified threshold values will be used after

clock cycles.

VI. VLSI IMPLEMENTATION RESULTS AND COMPARISON

WITH OTHER WORKS

A hybrid SBF decoder was synthesized with SynopsysDesign Compiler using the Chartered Semiconductor 0.18-5LM CMOS standard cell library. Synopsys Astro has beenused for placement and routing, and PrimeTime for static timingand power estimation. Listed in Table II are the post-layoutsimulation results compared to the state-of-the-art decoders.

Fig. 15. Layout of the hybrid SBF decoder.

The developed hybrid SBF decoder (Fig. 15) shows compa-rable error performance with the (1057, 813) PG-LDPC codewhen compared to the state-of-the-art decoders presented in[14]–[17] using the well-known structured LDPC codes. Witha parallel factor of 64 and the four-stage pipeline technique, thedeveloped decoder attained 17.37 Gbits/s of peak throughputwhen there is no error in the received frames. On the otherhand, if every decoding trial fails, the throughput decreasesto 253 Mbits/s by consuming all 40 iterations. The developeddecoder demonstrates 1.05 Gbits/s when .


The decoders in [14]–[16] show quite higher throughputand that in [17] presents comparable throughput, but theyrequire much larger chip area. Since the throughput of theproposed LDPC decoder can be increased almost proportion-ally by raising the parallel factor, there is a room to increasethe throughput by investing more chip area. In this sense, thedeveloped decoder seems to be very competitive when com-paring the maximum throughput per chip area. The developedchip, however, does not show good minimum throughput.This is partly because the iteration limit of this chip is set to40, while that of the others is 16. The minimum throughputof the developed decoder increases to 656 Mbits/s when theiteration limit is reduced to 16. In this paper, the iterationlimit is set to the number yielding approximately twice thedecoding failures compared to the asymptotic case when asufficiently large number of iterations are given (see Fig. 4).However, in some prior works, such as [15] and [17], it seemsthat the error performance is measured with a large iterationlimit and the decoder is implemented with a small one tomeet the minimum throughput. In this case, the implementeddecoder chip may show worse performance than expected. Inaddition, the developed SBF decoder chip does not employthe layered scheduling that can operate both VPU and CPUsimultaneously. If the minimum throughput is considered veryimportant, this design can be improved by employing thelayered decoding algorithm. Another remark should be madethat, in many works, the number of clock cycles requiredfor IO (i.e., the time to output a decoded frame and fill thememories or registers with a received frame) has not beenconsidered when estimating the decoding throughput. Sincethe lengths of LDPC codes used in [14]–[17] are thousandsof bits, the IO time would much reduce the demonstratedthroughput. In this paper, we implemented the IO buffers thatcan shorten the data fetching time to a single clock cycle, andthe IO time is estimated in computing the throughput. Withoutthese IO buffers, the chip area shrinks by approximately 1.0

(13.5%) at the expense of reduced throughput. Mostimportantly, this chip demonstrates the implementation fea-sibility of LDPC codes with large variable and check nodedegrees, while practical LDPC codes have only employedsmall variable and check node degrees for limiting the chipsize. The implementation feasibility of large-weight LDPCcodes can lead to the discovery of better practical codes.

VII. CONCLUDING REMARKS

The SBF algorithm adopts the pseudomarginalization schemefor hardware complexity reduction but employs nonuniformand multistrength BF techniques to improve error performance.The hardware reduction is more prominent for large-weightLDPC codes, such as geometric LDPC codes, when comparedwith the sum-product or min-sum decoding algorithms. In orderto improve the convergence speed, a hybrid decoding schemehas also been developed. When implemented by a 0.18-CMOS technology, a 4-bit hybrid SBF decoder employing

64-way parallel and four-stage pipelined architecture couldachieve a peak throughput of 17.37 Gbits/s with 7.4- area.It provides 0.2-dB close performance to the floating-point SPAat a bit error rate of .

REFERENCES

[1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inf.Theory, vol. IT-8, no. 1, pp. 21–28, Jan. 1962.

[2] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA:MIT Press, 1963.

[3] D. J. MacKay, “Good error-correcting codes based on very sparse ma-trices,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399–432, Mar. 1999.

[4] T. J. Richardson and R. L. Urbanke, “The capacity of low-densityparity-check codes under message-passing decoding,” IEEE Trans.Inf. Theory, vol. 47, no. 2, pp. 599–618, Feb. 2001.

[5] R. M. Tanner, “A recursive approach to low complexity codes,” IEEETrans. Inf. Theory, vol. IT-27, no. 5, pp. 533–547, Sep. 1981.

[6] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs andthe sum–product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2,pp. 498–519, Feb. 2001.

[7] Y. Kou, S. Lin, and M. P. C. Fossorier, “Low-density parity-checkcodes based on finite geometries: A rediscovery and new results,” IEEETrans. Inf. Theory, vol. 47, no. 7, pp. 2711–2736, Nov. 2001.

[8] J. Cho and W. Sung, “High-performance and low-complexity decodingof high-weight LDPC codes,” (in Korean) J. Korea Inf. Commun. Soc.,vol. 34, no. 5, pp. 498–504, May 2009.

[9] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexityiterative decoding of low-density parity check codes based on beliefpropagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, May1999.

[10] J. Chen, A. Dholakia, E. Eleftheriou, M. P. C. Fossorier, and X.-Y.Hu, “Reduced-complexity decoding of LDPC codes,” IEEE Trans.Commun., vol. 53, no. 8, pp. 1288–1299, Aug. 2005.

[11] J. Zhang and M. P. C. Fossorier, “A modified weighted bit-flipping de-coding of low-density parity-check codes,” IEEE Commun. Lett., vol.8, no. 3, pp. 165–167, Mar. 2004.

[12] M. Jiang, C. Zhao, Z. Shi, and Y. Chen, “An improvement on the mod-ified weighted bit flipping decoding algorithm for LDPC codes,” IEEECommun. Lett., vol. 9, no. 9, pp. 814–816, Sep. 2005.

[13] R. Palanki, M. P. C. Fossorier, and J. S. Yedidia, “Iterative decoding ofmultiple-step majority logic decodable codes,” IEEE Trans. Commun.,vol. 55, no. 6, pp. 1099–1102, Jun. 2007.

[14] L. Liu and C. J. R. Shi, “Sliced message passing: High throughput over-lapped decoding of high-rate low-density parity-check codes,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 3697–3710,Dec. 2008.

[15] C. Zhang, Z. Wang, J. Sha, L. Li, and J. Lin, “Flexible LDPC decoderdesign for multi-Gb/s applications,” IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 57, no. 1, pp. 116–124, Jan. 2010, to be published.

[16] H. Zhong, W. Xu, N. Xie, and T. Zhang, “Area-efficient min-sumdecoder design for high-rate quasi-cyclic low-density parity-checkcodes in magnetic recording,” IEEE Trans. Magn., vol. 43, no. 12, pp.4117–4122, Dec. 2007.

[17] M. M. Mansour and N. R. Shanbhag, “A 640-Mb/s 2048-bit pro-grammable LDPC decoder chip,” IEEE J. Solid-State Circuits, vol.41, no. 3, pp. 684–698, Mar. 2006.

Junho Cho (M’10) received the B.S., M.S., andPh.D. degrees in electrical engineering and computerscience from Seoul National University, Seoul,Korea, in 2004, 2006, and 2010, respectively.

He was a Visiting Researcher with the Universityof Illinois at Urbana–Champaign, Urbana, fromNovember 2008 to September 2009. He is currentlya Postdoctoral Researcher with Seoul NationalUniversity. His research interests are the design andimplementation of error-correcting codes, and signalprocessing for reliable storage devices.


Jonghong Kim received the B.S. degree in electronicand information engineering from the Seoul NationalUniversity of Technology, Seoul, Korea, in 2007 andthe M.S. degree in electrical engineering and com-puter science from Seoul National University, Seoul,in 2009, where he is currently working toward thePh.D. degree.

His research interests include VLSI implementa-tion of error correction systems and its optimization.

Wonyong Sung (S’84–M’87–SM’07) received thePh.D. degree in electrical and computer engineeringfrom the University of California, Santa Barbara, in1987. During his Ph.D. course, he studied vector andmultiprocessor implementation of signal processingalgorithms.

He has been a Faculty Member with Seoul Na-tional University, Seoul, Korea, since 1989. He wasa member of the editorial board for the Signal Pro-cessing Journal, Elsevier, from 2007 to 2009. Hismain research interests are the development of VLSI

systems and parallel processing software for signal processing. His researchtopics also include fixed-point optimization and speech signal processing.

Dr. Sung was an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS

AND SYSTEMS II from 2000 to 2001. He is a member of the VLSI Systems andApplications Technical Committee of the IEEE Circuits and Systems Societyand the Chair of the Design and Implementation of Signal Processing SystemsTechnical Committee of the IEEE Signal Processing Society. He was the Gen-eral Chair of the IEEE Workshop on Signal Processing Systems (SiPS) 2003and a Technical Program Cochair of SiPS 2009.