8
Decoding of Quasi-cyclic LDPC Codes Using an On-the-Fly Computation Kiran K. Gunnam, Gwan S. Choi, Weihuang Wang, Euncheol Kim, and Mark B. Yeary' Department of ECE, Texas A&M University, College Station, TX-77843 'Department of ECE, University of Oklahoma, Norman, OK-73 109 Abstract-The implementation complexity of the decoder for Low-density parity-check Codes (LDPC) is dictated by memory and interconnect requirements. We propose new LDPC Decoder architectures that reduce the need of message passing memory by 80%(for standard message passing)-55%(for layered decoding) and the router requirements by more than 50%. These novel architectures are based on scheduling of computation that results in "on the fly computation" of variable node and check node reliability messages. These architectures are targeted for quasi- cyclic LDPC codes such as array LDPC codes (regular QC- LDPC) and Block LDPC codes (irregular QC-LDPC). FPGA and ASIC implementation results show substantial gains when compared to the existing work in the literature. Index Terms-low-density parity-check (LDPC) codes, offset min-sum, vector processing, decoder architecture, layered decoding, turbo decoding message passing, array LDPC, quasi- cyclic LDPC, Block LDPC, irregular LDPC, IEEE 802.16e, WiMax, IEEE 802.11n, Wi-fi, on-the-fly computation, block- serial processing. I. INTRODUCTION Low-density parity-check (LDPC) codes and turbo codes are among the best known codes that operate near the Shannon limit [1]. When compared to the decoding of turbo codes, LDPC decoders require simpler computational processing, and they are more suitable for parallelization and low complexity implementation. LDPC codes are considered for error correction coding in virtually all next generation communication systems. While parallel LDPC decoder designs for randomly constructed LDPC codes suffer from complex interconnect issues [2], various semi-parallel [3]-[8] and parallel implementations [9]-[10], based on structured LDPC codes, alleviate the interconnect complexity. Mansour and Shanbhag [3] introduced the concept of turbo decoding message passing (TDMP), which is sometimes also called layered decoding [4], using BCJR for their architecture-aware LDPC (AA-LDPC) codes. TDMP offers 2x throughput and significant memory advantages when compared to standard two-phase message passing (TPMP). A quantitative performance comparison for different check updates was given by Chen et al. [11]. Their research showed that the performance loss for offset min-sum (OMS) decoding with 5- bit quantization is less than 0.1dB in SNR compared with that of optimal floating point SP (Sum of Products) and BCJR. This paper introduces the following concepts to LDPC decoder implementation: Block serial scheduling [5], Value- reuse, Scheduling of layered processing, Out-of-order block processing, Master-slave router, Dynamic state. All these concepts are termed as On-the-fly computation as the core of these concepts is based on minimizing memory and re- computations by employing just-in-time scheduling. Based on this paradigm, we present a new TPMP architecture that exploits the various properties of regular array LDPC codes [9] and the value-reuse properties offset min-sum algorithm to reduce the memory, routing and computational requirements. The key features of this architecture are: 1) 80% savings in message passing memory requirements when compared to other semi-parallel architectures based on min-sum and its variants [6], [7], 2) Scalable for any code length due to the concentric and regular layout unlike the fully parallel architecture [2], 3) Reduction of router multiplexers from 50% and beyond based on dynamic state concept. Another main contribution of this paper based on On-the-fly computation is an efficient turbo decoding message passing (TDMP) architecture which utilizes the value-reuse property of OMS, cyclic shift property of quasi-cyclic (QC) LDPC codes, and the extension of block serial scheduling [5]. The resulting decoder architecture has the following key advantages: 1) removal of memory needed to store the sum of the variable node messages and the channel log-likelihood ratios (LLR) when compared to other semi-parallel architectures [3]-[8]. 2) 40%-72% savings in storage of extrinsic messages depending on rate of the codes when compared to other semi-parallel architectures [3]-[8], 3) need of only one cyclic shifter instead of two cyclic shifters when compared to the work in [3]-[8], 4) removal of memory needed to store variable node messages when compared to [6]- [8] and finally, 5) increase of throughput by 2x as number of required iterations decrease by 50% when compared to [5]- [8]. The rest of the paper is organized as follows. Section II introduces the background of QC LDPC codes, and OMS, the decoding algorithm. Section III presents the equations which facilitate the decoding process. Section IV presents the value- reuse property and micro-architecture structure for CNU. The architecture for TPMP using OMS is included in section V. The new data flow graph and architecture for TDMP using OMS is included in section VI. Section VII shows the FPGA 1-4244-0785-0/06/$20.00 1192

[IEEE 2006 Fortieth Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2006.10.29-2006.11.1)] 2006 Fortieth Asilomar Conference on Signals, Systems and

  • Upload
    mark-b

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Decoding of Quasi-cyclic LDPC Codes Using

an On-the-Fly Computation

Kiran K. Gunnam, Gwan S. Choi, Weihuang Wang, Euncheol Kim, and Mark B. Yeary'Department of ECE, Texas A&M University, College Station, TX-77843

'Department of ECE, University of Oklahoma, Norman, OK-73 109

Abstract-The implementation complexity of the decoder forLow-density parity-check Codes (LDPC) is dictated by memoryand interconnect requirements. We propose new LDPC Decoderarchitectures that reduce the need of message passing memory by80%(for standard message passing)-55%(for layered decoding)and the router requirements by more than 50%. These novelarchitectures are based on scheduling of computation that resultsin "on the fly computation" of variable node and check nodereliability messages. These architectures are targeted for quasi-cyclic LDPC codes such as array LDPC codes (regular QC-LDPC) and Block LDPC codes (irregular QC-LDPC). FPGAand ASIC implementation results show substantial gains whencompared to the existing work in the literature.

Index Terms-low-density parity-check (LDPC) codes, offsetmin-sum, vector processing, decoder architecture, layereddecoding, turbo decoding message passing, array LDPC, quasi-cyclic LDPC, Block LDPC, irregular LDPC, IEEE 802.16e,WiMax, IEEE 802.11n, Wi-fi, on-the-fly computation, block-serial processing.

I. INTRODUCTION

Low-density parity-check (LDPC) codes and turbo codesare among the best known codes that operate near the Shannonlimit [1]. When compared to the decoding of turbo codes,LDPC decoders require simpler computational processing, andthey are more suitable for parallelization and low complexityimplementation. LDPC codes are considered for errorcorrection coding in virtually all next generationcommunication systems. While parallel LDPC decoder designsfor randomly constructed LDPC codes suffer from complexinterconnect issues [2], various semi-parallel [3]-[8] andparallel implementations [9]-[10], based on structured LDPCcodes, alleviate the interconnect complexity. Mansour andShanbhag [3] introduced the concept of turbo decodingmessage passing (TDMP), which is sometimes also calledlayered decoding [4], using BCJR for their architecture-awareLDPC (AA-LDPC) codes. TDMP offers 2x throughput andsignificant memory advantages when compared to standardtwo-phase message passing (TPMP). A quantitativeperformance comparison for different check updates was givenby Chen et al. [11]. Their research showed that theperformance loss for offset min-sum (OMS) decoding with 5-

bit quantization is less than 0.1dB in SNR compared with thatof optimal floating point SP (Sum of Products) and BCJR.

This paper introduces the following concepts to LDPCdecoder implementation: Block serial scheduling [5], Value-reuse, Scheduling of layered processing, Out-of-order blockprocessing, Master-slave router, Dynamic state. All theseconcepts are termed as On-the-fly computation as the core ofthese concepts is based on minimizing memory and re-computations by employing just-in-time scheduling. Based onthis paradigm, we present a new TPMP architecture thatexploits the various properties of regular array LDPC codes[9] and the value-reuse properties offset min-sum algorithm toreduce the memory, routing and computational requirements.The key features of this architecture are: 1) 80% savings inmessage passing memory requirements when compared toother semi-parallel architectures based on min-sum and itsvariants [6], [7], 2) Scalable for any code length due to theconcentric and regular layout unlike the fully parallelarchitecture [2], 3) Reduction of router multiplexers from 50%and beyond based on dynamic state concept.

Another main contribution of this paper based on On-the-flycomputation is an efficient turbo decoding message passing(TDMP) architecture which utilizes the value-reuse propertyof OMS, cyclic shift property of quasi-cyclic (QC) LDPCcodes, and the extension of block serial scheduling [5]. Theresulting decoder architecture has the following keyadvantages: 1) removal of memory needed to store the sum ofthe variable node messages and the channel log-likelihoodratios (LLR) when compared to other semi-parallelarchitectures [3]-[8]. 2) 40%-72% savings in storage ofextrinsic messages depending on rate of the codes whencompared to other semi-parallel architectures [3]-[8], 3) needof only one cyclic shifter instead of two cyclic shifters whencompared to the work in [3]-[8], 4) removal of memoryneeded to store variable node messages when compared to [6]-[8] and finally, 5) increase of throughput by 2x as number ofrequired iterations decrease by 50% when compared to [5]-[8].

The rest of the paper is organized as follows. Section IIintroduces the background of QC LDPC codes, and OMS, thedecoding algorithm. Section III presents the equations whichfacilitate the decoding process. Section IV presents the value-reuse property and micro-architecture structure for CNU. Thearchitecture for TPMP using OMS is included in section V.The new data flow graph and architecture for TDMP usingOMS is included in section VI. Section VII shows the FPGA

1-4244-0785-0/06/$20.00 1192

and ASIC implementation results, and performancecomparison with related work. Section VIII concludes thepaper.

II. BACKGROUND

A. Regular Quasi-cyclic LDPC CodesThe array LDPC parity-check matrix is specified by three

parameters: a prime number p and two integers k (check-node degree) and j (variable-node degree) such

that j, k < p [12]. This is given by

I I I *.. I2k-

I a ... aklH= I a2 a4 ... 2(k-1) ()

I a'l a'(j-1)2 (j-l)(k-l)

where I is a p X p identity matrix, and a is apX p permutation matrix representing a single left cyclicshift (or equivalently down cyclic shift) of I. The exponent ofa in H is called the shift coefficient and denotes multiplecyclic shifts, with the number of shifts given by the value ofthe exponent.

B. Irregular Quasi-cyclic array LDPC CodesRate-compatible array LDPC codes are modified version ofabove for efficient encoding and multi-rate compatibility in[13] and their H matrix has the following structure

I I I... I I ... I

O I a d... o-2 d-'-2H= O O I ..... d(j-3) CMj-2) CMk-3) , (2)

o._° ° *-- I -l .. j)(k-j)

where 0 is the p X p null matrix. The LDPC codes defined

by H in (lb) have a codeword length N = kp , number of

parity-checks M = jp , and an information block length

K = (k - j)p. The family of rate-compatible codes isobtained by successively puncturing the left most p columns,and the topmost p rows. According to this construction, a

rate-compatible code within a family can be uniquely specifiedby a single parameter, say, q with 0 < q < j - 2. To have a

wide range of rate-compatible codes, we can also fix j, p,and select different values for the parameterk Since all thecodes share the same base matrix size p; the same hardwareimplementation can be used. It is worth mentioning that thisspecific form is suitable for efficient linear-time LDPCencoding [13]. The systematic encoding procedure is carriedout by associating the first N - K columns of H with paritybits, and the remaining K columns with information bits.

C. Irregular Quasi-cyclic Block LDPC CodesThe block irregular LDPC codes have competitive

performance and provide flexibility and lowencoding/decoding complexity [14]-[15]. The Block LDPCcode is defined by a matrix H as:

p1, p1,2 .. Pl,Nb

P2,1 2,2 2,Nb =Pflb (3

PM PM *..*PM_ Mb,1 Mb,2 Mb,Nb _

where Pij is one of a set of z-by-z cyclically right shifted

identity matrices or a z-by-z zero matrix. Each 1 in the basematrix Hb is replaced by a permuted identity matrix while

each 0 in Hb is replaced by a negative value to denote a z-by-

z zero matrix.

D. Offset Min-sum Decoding ofLDPCAssume binary phase shift keying (BPSK) modulation (a 1

is mapped to -1 and a 0 is mapped to 1) over an additive whiteGaussian noise (AWGN) channel. The received values Yn are

2Gaussian with mean xn= and variance a The

reliability messages used in belief propagation (BP)-basedoffset min-sum algorithm can be computed in two phases: 1.check-node processing and 2. variable-node processing. Thetwo operations are repeated iteratively until the decodingcriterion is satisfied. This is also referred to as standardmessage passing or two-phase message passing (TPMP). For

the ith iteration, Q( is the message from variable node n to

check node m ,R(') is the message from check node m to

variable node n, M(n) is the set of the neighboring check

nodes for variable node n, and N(m) is the set of theneighboring variable nodes for check node m. The messagepassing for TPMP based on OMS is described in the followingthree steps as given in [11] to facilitate the discussion onTDMP in the next section:Step 1. Check-node processing: for each m and n e N(m),R(')=-( ) max(K( -/,0), (4)

K(i) = |( |= mn I( ) | (5)K~' R(') mmi(n1, 5nIm =n'1c N(m)\n Qnm

where ,6 is a positive constant and depends on the code

parameters [11]. For (3, 6) rate 0.5 array LDPC code, ,6 iscomputed as 0.15 using the density evolution techniquepresented in [11]. The sign of check-node messageRmn) is

defined as

K(ine() g(Q(imB)J (6)Semn . rI sgn s fnm ) N(n)n e=N(m)\n

Step 2. Variable-node processing: for each n andm Ez N(n),

1193

Q2i)= L(°) + E R(i,), (7)m'EM(m)\m

where the log-likelihood ratio of bit n is = y .

Step 3. Decision: forfinal decoding

Pn = L(n°) + ZR2. (8)meM(n)

A hard decision is taken by setting xn = 0 if Pn (Xn) > 0

and .I,- if In(xn)<0 If xHT =0, the decodingprocess is finished with i' as the decoder output; otherwise,repeat steps 1-3. If the decoding process doesn't end withinpredefined maximum number of iterations, itmax, stop and

output an error message flag and proceed to the decoding ofthe next data frame.

III. TDMP FOR QC-LDPC

In TDMP, the array LDPC with j block rows can beviewed as concatenation of j layers or constituent sub-codessimilar to observations made for AA-LDPC codes in [3]. Afterthe check-node processing is finished for one block row, themessages are immediately used to update the variable nodes(in step 2, above), whose results are then provided forprocessing the next block row of check nodes (in step 1,above). We first illustrate the vector equations for TDMP forarray LDPC codes assuming that the H matrix has the structurein (1). These equations are directly applicable to all otherregular QC-LDPC codes (such as cyclotomic coset basedLDPC). For rate compatible array LDPC codes and BlockLDPC codes, minor modifications in vector equations arenecessary[ 18].

~(O) -0 i - 0)-,n= O Pn = Ln [Initialization for each new received

data frame], (9)Vi = 1,2, * * it.x, [Iteration loop]

Vl = 1, 2,9. , j, [Sub-iteration loop]

Vn = 1,2,** *, k, [Block column loop]

[Qln1,n)= [n r1n _' ?1 (10)= f JQn ,Vn' = 1,2, ...,k), (1 1)

[ (1,n) = [ (j)]S(l,n) +iRf (12)

where the vectors R(') and Qf4 represent all the R and Q

messages in each p X p block of the H matrix, s(l, n) denotesthe shift coefficient for the block in I1h block row and nh block

column of the H matrix. [Q ]S(1n) denotes that the vector

60 is cyclically shifted down by the amount s(l, n) and k is

the check-node degree of the block row. A negative sign on

s(l, n) indicates that it is cyclic up shift (equivalent cyclicright shift). f () denotes the check-node processing, whichcan be done using BCJR or SP or OMS. For this work, we use

OMS as defined in (4)-(6). If we are processing a block row inserial fashion using p check-node units (11), then the outputof the CNU will also be in serial form. As soon as the output

vector R(W corresponding to each block column n in H matrix

for a block row I is available, this could be used to produceS(] ,n)

updated sum (12). This could be immediately used

in (10) to process block row I + 1 except that the shift

s(l, n) imposed on Pn has to be undone and a new shift

s(l + 1, n) has to be imposed. This could be simply done by

imposing a shift corresponding to the difference of s(l + 1, n)

ands(l,n) .

A. Extension for Irregular Quasi-cyclic array LDPC CodesNote that due to the slight irregularity in array LDPC matrix

defined in (2), each block row I has a node degree j -I + 1 .

The variable-nodes in each block column n has a node degreeequal to min(n, j). One simpler way to facilitateimplementation is to assume that all the block rows have equalcheck-node degree and set the check-node messagescorresponding to null blocks in H matrix to zero in order not toaffect the variable-node processing. Similarly the variable-node messages belonging to the null blocks are always set topositive infinity in order not to affect the check-nodeprocessing. For check-node update based on SP or OMS, themessage with maximum reliability won't affect the CNUoutput. In the specific case of OMS, it is easy to see this as theCNU magnitude is dependent on the two least minimum.

B. Extension for Irregular Quasi-cyclic Block LDPC CodesThere are several null blocks in Block LDPC codes defined

by (3). So the above method to deal with irregularityintroduces significant idle clock cycles. One possible way todeal with this check-node irregularity is setting the check-nodedegrees in CNU processor unit based on the block row that isbeing processed. In addition, out-of-order processing need tobe enforced in the generation of R messages as explained inlater sections.

IV. VALUE-REUSE PROPERTIES OF OMS

This section presents the micro-architecture of serial CNU

for OMS [16]. For each check node m, R(i)Vne N(m) takes only two values, which are the two least

minimum of input magnitude values. SinceVn E N(m), 8(i)mn

takes a value of either +1or -1 and RN4) takes only 2 values,

(4) gives rise to only three possible values for the whole set,

R2) Vn E N(m). In a VLSI implementation, this propertysignificantly simplifies the logic and reduces the memory.

Fig. l(a) shows the CNU micro-architecture for (5, 25) codewhile Fig. 1(b) shows the block diagram of the same. In thefirst 25 clock cycles of the check-node processing, incoming

1194

-fset R selector(a)

-1 F~~~~~~inal Statea M M2 artal

nder 1 State1 Hse R

7-11 ~~~~~~~Rselecto

_iCounter

Fig 1. Serial CNU for OMS using value-reuseproperty.(a)detailed micro-architecture (b)Correspondingblock diagram.

PS_i n FSii n

Parta nal Rrtder SEot at aSat

PS out FSo

Fig 2. Serial CNU with dynamic shift mechanism for OMSusing value-reuse property

variable messages are compared with the two up-to-date leastminimum numbers (partial state, PS) to generate the new

partial state, MI which is the first minimum value, M2 whichis the second minimum value and index of MI. The final state(FS) is then computed by offsetting the partial state. It shouldbe noted that the final state include only Ml, -Ml, +/-M2 withoffset correction. Fig. l(b) is the block diagram of the same

Ii1~~~~~~~~~~~~~~~~~~~~~~~~~~~~J,

-l

i

1f.i

II

iQ

P3

9

FPSq9 -- ' It

\

cwuArray

. El ock r

i1,. 1 k>W 1VNi

.2 CNU

AFr r ay 62lEl ockcni

Akray CQElockbw 3

architecture. Ml_M2 finder computes the two least numbers,according to the incoming data and the current minimumnumbers stored in partial state. The offset module applies theoffset correction, and stores the results in the Final Statemodule. R selector then assigns one out of these 3 values,based on the index of MI and the sign of R message generatedby sign XOR logic (6), to the output R messages. While thefinal state has dependency on offset correction, the offset isdependent on the completion of partial state. In operation, thefinal state and partial state will operate on different check-nodes. The serial CNU finds the least two minimum numberswith 2 comparators in a serial fashion and reduces the numberof offset-correction computation from k to 2. Normally,CNU (check-node unit) processing is done using the signedmagnitude arithmetic for (4)-(6) and VNU (variable-node unitprocessing) (7) is done in 2's complement arithmetic. Thisrequires 2's complement to signed conversion at the inputs ofCNU and signed to 2's complement at the output of CNU. Inthe proposed scheme, 2's complement is applied to only 2

values instead of k values at the output of CNU. The valuere-use property also reduces the memory requirementsignificantly. Conventionally, the number of messages eachCNU stores is equal to the number of edges it has, that is k .Now only four units of information are needed: the three

values that R(i) may take and the location ofM P(), thenmn ~~~~~~~m

check-node message to the VNU is readily chosen bymultiplexing.

F21X-

I QQQfl23RFS X

QCNf24RF%r F%.

F§-aA / FKnF 2A FS

Q 1

15~ ~~~1IItF %tl

R1F$, F&.

ar2S3 rAu :2n*tl2% + $

F F3 Ps F:,,F/l x83

FSF

FCiu Xi2t/ ji _ 13 F%1Z r 4tj .s31

cq1

Fig. 3. Fixed code block-serial architecture for TPMP and OMSfor regular array LDPC codes. No routers are needed. Nomessage passing memory is needed exceptfor the sign bits ofQ messages.

1195

12.. 9-7

Pr, RR F

V.FIXED CODE ARCHITECTURE USING TPMP AND OMS

Fig. 3. presents the proposed architecture for theimplementation of (3, 30) - array LDPC code of length 1830with the circulant matrix size of 61. The check nodeprocessing unit array is composed of 3 sub-arrays. Each sub-array contains 61 serial CNUs which compute the partial statefor each block row to produce the check-node messages foreach block column of H. Block row 1 is array of 61 simpleCNUs (Fig 1). CNU array block row 2 and 3 are composed ofdynamic CNUs (Fig 2). The variable node processing array iscomposed of 61 parallel VNU units which can process 3 x 61messages at each clock cycle. The sign bits will be stored in aFIFO (implemented as RAM), however, there is no need tosubject these values to shifts as these values are not modifiedin check node processing partial state processing. In the arrayof simple serial CNU that is designed to do check nodeprocessing for first block row in H matrix, the check nodeprocessing for each row in H matrix is done such that all thecomparisons are performed locally with in one CNU to updatethe partial state each clock cycle and transfer the partial state

to final state dc once every cycle. In the array of dynamicCNU designed for second block row in H matrix, CNU 122gets its partial state from CNU 121, CNU 121 gets its partialstate from CNU 120 and so on. Array of dynamic CNUdesigned for the third block row in H matrix such that theconnection between partial state registers among various unitsachieve cyclic shifts of [0,2,..,58]. Similar principle is usedwhen making connections for the final state in the CNU arrayto achieve reverse routing.

Initially the variable messages are available in row wiseas they are set to soft log likelihood information (LLR) of thebits coming from the channel. Q Init is an SRAM of size 2Nand holds the channel LLR values of two different frames. Itcan supply p intrinsic values to the VNUs each clock cycle.The data path of the design is set to 5 bits to provide the sameBER performance as that of the floating point sum of productsalgorithm with 0.1-0.2 dB SNR loss [11]. Each iteration takes

dc + 3 clock cycles. For (3, 30) code this results in 6 x 33

clock cycles to process each frame when a maximum numberof iterations set to 6. For (3,6) code this results in 20 x 9 clockcycles to process each frame when the number of iterations isset to 20.

VI. MULTI-RATE ARCHITECTURE USING TDMP AND OMS

A. Optimally Scaled ArchitectureA new data flow graph is designed based on the TDMP, and

on the value reuse property of min-sum algorithm describedabove (see Fig. 4.). For ease of discussion, we will illustratethe architecture for a specific structured code: regular arraycode of length 1525 described by (1), j = 5, k = 25 and

p = 61, the discussion can be easily generalized to any otherstructured codes. First, functionality of each block in the

architecture is explained. A check-node process unit (CNU) isthe serial CNU based on OMS described in previous section.The CNU array is composed of p computation units thatcompute the partial state for each block row to produce the Rmessages in block serial fashion. Since final state of previousblock rows, in

FS Registers R Sign FIFOLayer 1 Layer I

Control Layer 2 select Layer 2Layer 3 Layer 3Layer 4 Roid Layer 4

Q Subtractor - QShiftArray + FS

R Sign Bit

.hiial.aton Cyclic CNUInit alizathon Shifter 1-61For PSum +

new frameli Adder Array +Channel M, + ew

LLR PFig. 4. Multi-rate LDPC Decoder using Layered Decodingand OMS.which the compact information for CNU messages is stored, isneeded for TDMP, it is stored in register banks. There is oneregister bank of depth j -1, which is 4 in this case, connectedwith each CNU. Each final state is the same as the final stateregister bank in the CNU. Besides the shifted Q messages, theCNU array also take input of the sign information for previouscomputed R messages in order to perform R selectionoperation. The sign bits are stored in sign FIFO. The totallength of sign FIFO is k and each block row has p one bit

sign FIFOs. We need j -1 of such FIFO banks in total. pnumber of R select units is used for Rold . An R select unitgenerates the R messages for 25(= k) edges of a check-nodefrom three possible values stored in final state registerassociated with that particular check-node in a serial fashion.Its functionality and structure is the same as the block denotedas R select in CNU. This unit can be treated as de-compressorof the check node edge information which is stored in compactform in FS registers. The generation of R messages for all thelayers in this way amounts to significant memory savings-which would be quantified in later section. The shifter isconstructed as cyclic down logarithmic shifter to achieve thecyclic shifts specified by the binary encoded value of the shift.The logarithmic shifter is composed of log2(p) stages ofp switches. Since cyclic up shift is also needed in theoperation of the decoder, cyclic up shift by u can be simplyachieved by doing cyclic down shift with p - u on the vector

of size p . The decoding operation proceeds as per the vector

equations described in section III. In the beginning of thedecoding process, P vector is set to received channel values inthe first k clock cycles(i.e. the first sub-iteration) as thechannel values arrive in chunks of p, while the output vector

1196

of R select unit is set to zero vector. The multiplexer array atthe input of cyclic shifter is used for this initialization. TheCNU array takes the output of the cyclic shifter serially, andthe partial state stage will be operating on these values. Afterk clock cycles, partial state processing will be complete andthe final state stage in CNU array will produce the final statefor each check-node in 2 clock cycles. Then R select unitwithin the each CNU unit starts generating k values of check-node messages in serial fashion. The CNU array thus producesthe check-node messages in a block serial fashion as there arep CNUs are operating in parallel. The P vector is computedby adding the delayed version of the Q vector (which is storedinto a FIFO SRAM to till the serial CNU produces the output)to the output vector R of the CNU. Note that the P vector thatis generated can be used immediately to generate the Q vectoras the input to the CNU array as CNU array is ready to processthe next block row. This is possible because CNU processingis split into three stages as shown in the pipeline diagram andpartial state stage and final state stage can operatesimultaneously on two different block rows. Now, the Pmessage vector will undergo a cyclic shift by the amount ofdifference of the shifts of the block row that is processed, andthe previous block row that was just processed. This shiftvalue can be either positive or negative indicating that a downshift or up shift need to be performed by the cyclic shifter Theshifted P sum messages are subtracted by R message to get theshifted version of Q messages.

CNU PS

CNU FS

R SelecIdon

P Update.. .... ....

Q Update .

Previous layer Current layer -++i Nex layer

CNU PS

CNU FS

R Seledion

P pdate

Q date

Ptbviouslaye~r Curn laye2r ifff Ndx lay

R selection for Rnew operates out-of-order to feed the data for PSprocessing of next layer

Fig. 5. Pipeline for the layered decoding (a) for regular QC-LDPC codes(regular array LDPC codes and cyclotomic cosetbased LDPC) and rate-compatible array LDPC codes (b) forirregular QC-LDPC codes(Block LDPC codes of IEEE802.16e, IEEE802. 1 ln)The snapshot of the pipeline of the decoder is shown in Fig.

5. Here, the partial state stage in CNU (CNU PS) is operatingon the 2nd block row from clock cycles labeled as 0 to 24 (notethat these numbers will not denote the actual clock numbers asthe snapshot is shown in the middle of the processing). Finalstate stage in CNU (CNU FS) can not start until the end of PS

processing, that is clock cycle 25. As soon as the FS is done inclock cycle 26, R select is able to select the output Rmessages, and P and Q messages processing starts. With thefirst block of Q message ready, PS for next block row can bestarted immediately. Note that the all the logic blocks(otherthan the storage elements) are active over 90% of the time.The only exception is the offset module, which is composed oftwo 5-bit adders, in each CNU. The overall proportion of allthe CNU FS logic in the overall decoder is less than 4%.Thecontrol unit also contains the information of array codeparameters such as j, k, q - these could be changed to

support multi-rate decoding. The family of rate-compatiblecodes is obtained by successively puncturing the left mostp columns and the topmost p rows in the H matrix defined

in (lb) q times. Changing q from 0 to 3(= j - 2) gives thecode rates of 0.8 to 0.909. Changing k values from 15 to 61while fixing j = 5 results in code rates from 0.666 to 0.91.

The Q FIFO needs to be of maximum depth p as the k cantake a maximum value equal to p . Note that for Block LDPCcodes, the Q FIFO has to be replaced with a regular memoryof the same size. In addition out-of-order processing isemployed on R selection for Rnew to account for theirregularity ofBlock LDPC codes (see Fig.5.b) [18].

B Scalable ArchitectureNote that the throughput of the architecture is increased by

increasing p of the code, and scaling the hardwareaccordingly. While the complexity of computational unitsscale linearly with p , the complexity of cyclic shifter

increases with the factor (p / 2) log 2 p . So, it is necessaryto change the architecture for large values ofp . Alternativelyit may be needed in low throughput applications to have lowparallelization. To suit this requirement, minor changes in theproposed architecture are necessary. Let us assume, the desired

parallelization isM < p. For the ease of implementation,

choose M close to the powers of 2. The cyclic shifter neededisM XM . Since it is needed to achieve p X p cyclic shift

with consecutive shifts ofM XA ,it is necessary that thecomplete vector of size p is available in M banks with the

depths = (ceil(p I M)) and shifting is achieved in part bythe cyclic shifter, and in part by the address generation. Now,all the CNU and variable node processing is done in a timedivision multiplexed fashion for each sub-vector of lengthM , so as to process the vector of size p to mimic thepipeline in Fig. 5.One should note that it is not possible to achieve cyclic

shifts specified by s(l, n) ,(=0,1,..z-l)on a vector of length z

with a cyclic shifter of size M XM if M is not a integermultiple of z. So to be able to accommodate different shiftsneeded for WiMax LDPC codes, we can use a Benes networkas in [19], which is of complexity 2log 2(M) -1 stages of

1197

M 2-in-I multiplexers. A memory can be used to store controlinputs needed for different shifts in case of supporting oneexpansion factor [3],[19]. The authors in [3] use Omeganetwork, which is less complex than Benes network [19].However both [3] and [19] will support only base H matrix.Note that this memory for providing control signals to this

network is equal to (2 log 2(M)-1) bits for every shift2

value that needs to be supported. This will be a very hugerequirement for supporting all the WiMax codes [14]. So toavoid the control memory requirement, cyclic shifter isconstructed as a Master-slave Benes network [17]-[18] toaccommodate 114 different parity check matrices in run timefor IEEE 802.16e. This approach eliminates the controlmemory requirements by generating the control signals for theslave data router with the help of a self routing masternetwork. Note that the control memory savings are specific toan IEEE 802.16e decoder.

VII. IMPLEMENTATION RESULTS AND DISCUSSION

We prototyped the proposed fixed code (see Fig. 6) andmulti-rate decoder architectures (see Tables I, III, and IV) onXilinx Virtex 2V8000-5 device. We also implemented theproposed decoder architecture using the open source standardcells vsclibO]3 [20] in 130 nm technology. The synthesis isdone using Synopsys design analyzer tool, while layout is doneusing Cadence's silicon ensemble tool. The synthesis resultsand performance metrics for the proposed architecture aregiven in Table V. The design details for the work in [3] aresummarized in Table II. When compared to the latest reportedmulti-rate decoders [3] (Table II), the decoder design for IEEE802.11n LDPC codes has an area efficiency of around 5.5xand energy efficiency of 2.6x for a given data throughput. Thenumbers are normalized for a 180nm CMOS process.

Each TDMP iteration consists of ] sub-iterations and eachsub-iteration takes k(=check node degree) clock cycles asdefined in sections II and III. Note that k can take any valuethat is less than km,, supported by the decoder implementation.This feature along with the ability to control the number oflayers makes the decoder to decode a wide range of differentrate compatible array LDPC codes and Block LDPC codes.To achieve the same BER as that of the TPMP schedule on

SP (or equivalent TPMP schedule on BCJR), the TDMPschedule on OMS needs half the number of iterations havingsimilar convergence gains reported for TDMP-BCJR [3] andTDMP-SP [4].

VIII. CONCLUSION

This paper presents the decoder architectures for regular andirregular quasi-cyclic LDPC codes that provide substantialgains over existing implementations. The proposedarchitectures utilize an on-the-fly computation paradigm whichpermits scheduling of the computations in a way that thememory requirements and re-computations are reduced. Usingthis paradigm, the run-time configurable and multi-rate VLSIarchitectures for the rate compatible array LDPC codes and

irregular block LDPC codes are designed. Both the fixed codedecoders and multi-rate decoders achieve significant savings.When compared to the latest reported multi-rate decoders, thedecoder design for IEEE 802.1 In LDPC codes has an areaefficiency of around 5.5x and energy efficiency of 2.6x for agiven data throughput.

REFERENCES

[1] D. MacKay and R. Neal, "Near Shannon limit performance of lowdensity parity check codes," Electronics Letters, vol. 32, pp. 1645-1646, Aug. 1996.

[2] A. Blanksby and C. Howland, "A 690-mW 1 -Gb/s 1024-b, rate- 1/2 low-density parity-check code decoder," IEEE J. ofSolid-State Circuits, vol.37, no.3, pp. 404-412, Mar 2002.

[3] M. Mansour and N. Shanbhag, "A 640-Mb/s 2048-bit programmableLDPC decoder chip," IEEE J. of Solid-State Circuits, vol. 41, no. 3, pp.684- 698, Mar. 2006.

[4] D. Hocevar, "A reduced complexity decoder architecture via layereddecoding of LDPC codes", IEEE SiPS, pp. 107-112, Oct. 2004.

[5] K. Gunnam, G. Choi and M. B. Yeary, "An LDPC decoding schedulefor memory access reduction," IEEE Int. Conf. on Acoustics, Speech,and Signal Processing, pp- 173-176 vol. 5, May 2004.

[6] M. Karkooti and J. Cavallaro, "Semi-parallel reconfigurablearchitectures for real-time LDPC decoding," Proceedings of the Int.Conf. on Information Technology, Coding and Computing, vol. 1, pp.579-585, Apr. 2004.

[7] T.Brack, F.Kienle, and N.Wehn, "Disclosing the LDPC Code DecoderDesign Space," Proceedings of Design Automation and Test in Europe(DATE) Conference, pp. 200-205, March 2006.

[8] L. Yang; M. Shen; H. Liu, and C. Shi, "An FPGA implementation oflow-density parity-check code decoder with multi-rate capability,"Proceedings of the Asia and South Pacific Design AutomationConference, pp. 760- 763 Vol. 2, 18-21 Jan. 2005

[9] E. Kim and G. Choi, "Diagonal low-density parity-check code forsimplified routing in decoder, " IEEE SiPS, pp. 756-761, Nov. 2005.

[10] K. Gunnam, G. Choi and M. B. Yeary, "A parallel layered decoderarchitecture for array LDPC codes," Accepted for IEEE VLSI DesignConference, January 2007.

[11] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier and X. Y. Hu,"Reduced-complexity decoding of LDPC codes," IEEE Trans. onCommunications, vol. 53, pp. 1288-1299, Aug. 2005.

[12] J. L. Fan, "Array codes as low density parity check codes," Proc. 2ndInternational Symposium on Turbo Codes and Related Topics, pp.543-546, Brest, France, Sept. 2000.

[13] A. Dholakia and S. Olcer, "Rate-compatible low-density parity-checkcodes for digital subscriber lines," in Proceedings of IEEEInternational Conference on Communications, pp. 415-419, Jun. 2004.

[14] "Part 16: air interface for fixed and mobile broadband wireless accesssystems amendment for physical and medium access control layers forcombined fixed and mobile operation in licensed bands", IEEEP802.16e-2005, October 2005.

[15] IEEE 802.11 Wireless LANsWWiSE Proposal: High Throughputextension to the 802.11 Standard. IEEE 11-04-0886--0-OOOn.

[16] K. Gunnam and G. Choi, "A Low Power Architecture for Min-SumDecoding of LDPC Codes," TAMU, ECE Technical Report, May, 2006,TAMU-ECE-2006-02. [Online]. Available: http://dropzone.tamu.edu/tehpubs

[17] K. Gunnam, G. Choi, M. B. Yeary and M.Atiquzzaman, "VLSIarchitectures for layered decoding for irregular LDPC codes of WiMax",TAMU, ECE Technical Report, July 2006, TAMU-ECE-2006-08.Available: http://dropzone.tamu.edu/techpubs.

[18] K. Gunnam, "Area and energy efficient VLSI architectures for low-density parity-check decoders using an on-the-fly computation," PhDDissertation, Texas A&M University, October 2006.

[19] G. Malema and M. Liebelt, "Interconnection network for structured low-density parity-check decoders," Asia-Pacific Conference onCommunications, 03-05 Oct. 2005, pp. 537- 540.

[20] Open source standard cell library. Available online:http://www.vlsitechnology.org

1198

p=61, rate 0.9, length 1830

p-211, rate 0.5, length 1266M. Karkooti et al., rate 0.5, length 1536

T. Brack, et al., rate 0.8, length 3000

54855

23040

19732

15534 20374

52

11|

1830015360 153

1 20 I14'

Table III.FPGA Implementation results ofthe multi-rate

1270 64.4 decoderfor IEEE 802.16e (supports z=24, 48 and 96and all the code rates) (Device, Xilinx 2V8000ff]52-5,

frequency 1 OMHz),[I 7J-[18]

585

180127

10.7

6.2rMessage Input Frequency Throughput Throughput

passing buffer (MHz) (Mbps) per LUT (Kbps)

memory (bits) (bits)

Fig.6. Comparison of proposed Fixed code TPMParchitecture implementations with [6] and [7]..

Table I.FPGA Implementation results for the layereddecoderfor rate-compatible array code, [18]

M 61 M-61 Availablep-61 p 347j= 5, j3,

k = 101.611 k 6S*32q 0,. ,3 q = I

Code Parameters Npk N kpk610-3X721 2082-1,1104Mx. Nh2 295 a. L =93Code rate, Code rate:U-059167 O.-059063

Slices 6,002 6182 46592LUT 7,713 8,022 93184SFF 9,981 10,330 93184BRAM 12 129 168

Memo y (bits) 37,210 131860

Through-put (Mbp) 68329 113319

Th iugh puter(Mbps) Mbps Mb1 s

MaxmumIteatins 1O0TDMP, 10OTDMP,OMS

Table I.

Reference ASIC Implementation results [3]

AA-LDPCGLDPC Code length 2048

rate compatible

Decoded ThroughpUt, td, 640 MbpsArea 14.3 mm

Frequency 125 MHzNorihnal Power Dissipation 787 rnWCMOS Technology 1 BO nmr 1 $V

Area Efficienry for t 180 nm 44.75 Mbps/MM2Energy Efficiency for td, 180 nmI 1;23 pJ/BiYlteration

M-24 M =48 M 96 AvailbleSlices 1640 3239 6568 46592LUT 2982 5664 11028 93184SFF 1582 3165 6330 93184BRAM 38 73 100 168

Memory (bits) 65760 65760 60288Through-put (Mbps) 41 70 57 139 61 278

Table IV.FPGA Implementation results for the multi-ratedecoder. Fully compliant to IEEE 802.1]n

(Device, XILINX2V8000FF152-5, frequency1 OMHZ)[18]

Table V.ASIC Implementation results, multi-rate decodersfor

Block LDPC codes. Fully compliant to IEEE802.16e(WiMax) and IEEE 802.1]n [18]

Code parameters Fully complaint to Fully complaint tosupported IEEE 802.11 n, IEEE 802.16e,

z 27,54,81 z-24,28 96& all the code rates & ali the code rates

Area (mm2) Area (mm2)ONU Array 044 0.53VNU Array 0.07 0408Memory 1 04 1 23

[55,344 bits [60,288 bitsjPipeline flip flops H02 0 03

Cyclic shifter 0 12 0Q74'thn[ogat[ic shifter] [Mastersave router]

Wiring 008 012Total chip area 1 78 2 73

Frequency 500 MHz 500 MHzDecoded Throughput 1 15 Gbps 1 37 GbpsPower Dissipation 282 mrW 429 mrW

Technology 130 nm CMOS,1.2 V 130 nm CMOS,1.2 VArea Efficiency for d, 1 80 nm 244 Mbps/ m2 1191 Mbps mm2

Energy Efficiency for td, 180 nm 47 pJ/Bit/teration 71 pAJBitllteration

1199

30520

11695

INo. Slices

No. LUTs

Mi 27 Al = 54 Af = -1 awu blesfices 01L 3 3647 55S141 46592[LUT 3317 6335 9 94) IS4SFF 1780 3560 5341 93184BRAM 46 89 I 33 L68

Wniory(bits) 5664O 56640 55344Throlilpighpuu 1N4h

Z 81 t 19 238 356z _54 119 238 178z - 27 119 119 I 9

1