06018325

2118 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012

Transactions BriefsA Nonbinary LDPC Decoder Architecture With Adaptive

Message Control

Weiguo Tang, Jie Huang, Lei Wang, and Shengli Zhou

AbstractA new decoder architecture for nonbinary low-density parity-check (LDPC) codes is presented in this paper to reduce the hardware oper-ational complexity in VLSI implementations. The low decoding complexityis achieved by employing adaptive message control (AMC) that dynami-cally trims the message length of belief information to reduce the amountof memory accesses and arithmetic operations. To implement the proposedAMC, we develop the architecture of a horizontal sequential nonbinaryLDPC decoder. Key components in the architecture have been designedwith the consideration of variable message lengths to leverage the benefitof the proposed AMC. Simulation results demonstrate that the proposednonbinary LDPC decoder architecture can significantly reduce hardwareoperations and power consumption as compared with existing work withnegligible performance degradation.

Index TermsAdaptive control, decoding, Galois field, Min-Sum, nonbi-nary low-density parity-check (LDPC) codes, VLSI architecture.

I. INTRODUCTION

Low-density parity-check (LDPC) codes [1], [2] are considered asone of the most powerful capacity-approaching codes. LDPC codescan be constructed in both binary domain and Galois fields (i.e.,

, where ). Binary LDPC codes have been studiedextensively [3][5] and adopted in many communication protocols,such as DVB-T2, WiMax, etc. In general, a very long code length isrequired for binary LDPC codes to approach the channel capacity.Nonbinary LDPC codes constructed in Galois fields [6] offer improvedperformance at a moderate code length. In addition, nonbinary LDPCcodes can be combined with high order modulations [7], [17] toincrease the bandwidth efficiency. Due to these features, design andimplementation of nonbinary LDPC codes have become critical formany emerging applications such as underwater acoustic communi-cations [17].

A key challenge in the application of nonbinary LDPC codes is theirhigh decoding complexity, as each symbol in the codeword is decodedusing a long message (e.g., in ). A lot of research effortaims at reducing the decoding complexity of nonbinary LDPC codesat the algorithm level [7][9], [11]. To deal with the problem that com-putational complexity increases exponentially with , the extendedMin-Sum (EMS) was proposed in [10] where only the most significant

entries in a message were used in the decoding. A decoding tech-nique developed in [12] conducted the EMS with a reduced complexityof

with minor performance degradation. It should benoted that these algorithm-level techniques do not explicitly considerthe complexity in the implementation of nonbinary LDPC decoders.While many hardware-efficient VLSI architectures were proposed for

Manuscript received December 05, 2010; revised March 04, 2011 and June14, 2011; accepted August 12, 2011. Date of publication September 15, 2011;date of current version July 27, 2012.

The authors are with the Department of Electrical and Computer Engineering,University of Connecticut, Storrs, CT 06269 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2011.2165346

binary LDPC decoders [3][5], few results [14][16] exist for nonbi-nary LDPC decoders.

Different from these existing work targeting hardware implementa-tion cost, the focus of this paper is to reduce the hardware operationalcomplexity in nonbinary LDPC decoder architectures. This enables ef-ficient decoding suitable for emerging applications such as underwateracoustic sensor networks [17] that are under the severe resource (e.g.,energy) constraints. It was reported [4] that memory accesses and arith-metic operations are the two major contributors to the operating cost inLDPC decoders. As the amount of memory accesses and arithmeticoperations is largely determined by the message length, reducing mes-sage length is deemed as an effective way for efficient decoding. Basedon this fact, our past work [18] has proposed to use adaptive messagecontrol (AMC) to reduce the decoding complexity. Different from theEMS which maintains a constant message length for every symbol, theproposed AMC adjusts the message length adaptively, which can re-duce the message length at the required performance.

In this paper, we develop a horizontal sequential VLSI architecturefor the nonbinary LDPC decoder employing the AMC. The designof the key components in this architecture, such as variable node andcheck node update units, is optimized by exploiting the variable lengthsorters, which can be dynamically configured in different functionalunits to accommodate variable message lengths. The AMC is imple-mented by a low-complexity approximation method to avoid hardwareoverheads and performance impact. A mapping table based approachis proposed to conduct searching operations with low complexity. Weapply AMC to EMS to address the memory and throughput issuescaused by the worst case message length. Note that AMC can also beemployed in other decoding update rules such as the Min-Max algo-rithm. In addition, the proposed AMC can be generally applied to mostexisting decoder architectures (sequential, partial parallel and fully par-allel).

II. ADAPTIVE MESSAGE CONTROLA nonbinary LDPC code is defined by its parity check matrix (PCM)

, which is an sparse matrix with low density ofnonzero entries. The nonzero entries of take values from a Galoisfield . A length- vector with entries having values from

is a codeword if and only if . Each entry in the code-word is called a symbol, which is represented by a length- messagerecording the belief information, i.e., the probabilities of this noisysymbol to be any of the elements in . An LDPC codewith PCM can be represented by a bipartite graph called Tannergraph, which consists of two groups of nodes: variable nodes

, and check nodes

. A variable node

is connected to a check node

if and only if

in the PCM isnonzero.

The major decoding operation at a check node is to refine the esti-mated message of a symbol based on the messages of other symbolsthat are correlated by the PCM . The elementary check node oper-ation in the Min-Sum (MS) decoding algorithm [10], [12] can be ex-pressed as

(1)

where

and

are the length- variable node messages, and thebasic operation is defined as

, where

and

are the entries in messages

(inlog domain) corresponding to , respectively.

U.S. Government work not protected by U.S. copyright.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 2119

On the other hand, each variable node improves the fidelity of beliefinformation based on the received messages from multiple check nodesconnected by the PCM. The elementary operation at a variable node canbe expressed as [10], [12]

(2)

which sums up the belief information associated with the same finite-field element.

The hardware complexity in implementing (1) and (2) is determinedby the lengths of variable node messages (VNMs) and check node mes-sages (CNMs). Intuitively, when the distribution of belief informationis more concentrated, a shorter message might be sufficient to retainmost of the belief information.

The basic idea of AMC is to keep as few entries in a message aspossible without incurring much information loss. It has been demon-strated [18] that message truncation can be implemented by finding theminimal that satisfies

(3)

where is the confidence factor that determines the tradeoffbetween performance and operational complexity [18], and indi-cates the th entry in the log domain representation of the message that is sorted in order. Note that messages are usually normalized withrespect to the largest entry to maintain the numerical stability [12]. Inthis case, the truncation criteria can be recast as

(4)

where the threshold is used to truncate messages.The operation of AMC in a nonbinary LDPC decoder can be sum-

marized as follows.Initialization Channel message truncation:

, where

is thereceived channel message.

Variable node message:

, where

is the variable nodemessage from the variable node

to the check node

.

Iterations Permutation

, where

is the nonzero PCM ele-ment, and the multiplication is conducted in .

Check node update

(5)

where is the set of neighboring variable nodes of thecheck node

excluding the variable node

.

Inverse permutation

, where

is the nonzero PCMelement, and the division is conducted in .

Variable node update

(6)

where is the set of neighboring check nodes of the variablenode

excluding the check node

, and means that theAMC is applied to all the intermediate results.

Tentative decoding

(7)

Fig. 1. Top-level block diagram of the horizontal nonbinary LDPC decoderemploying the AMC.

III. VLSI ARCHITECTURE FOR SEQUENTIAL AMC-BASED DECODERIn this section, we present a sequential nonbinary LDPC decoder

architecture for the proposed AMC. We first discuss the top level ar-chitecture and then detail the design of several key components such asthe variable length sorter, variable node update unit (VNU), and checknode update unite (CNU). The proposed AMC can be applied to dif-ferent nonbinary LDPC decoding schedules, such as sequential, par-tially parallel and fully parallel architectures. We adopt the sequentialschedule in this paper due to its high convergence speed [13].

A. Top Level Decoder ArchitectureFig. 1 illustrates the proposed sequential nonbinary LDPC decoder

architecture that performs the zigzag decoding schedule [13]. At the be-ginning of the decoding, the truncated channel messages are loaded intothe memory RAMb. Tentative decoding is conducted with the channelmessages and the results are stored in the memory RAMc. If tenta-tive decoding suceeds, decoding terminates and RAMc outputs the finalresult. Otherwise, the decoder initializes the intermediate check nodemessages (ICNMs) in the memory RAMd with the channel messages.

After the initialization, the CNU reads ICNMs from RAMd to per-form the check node update described in (5). The CNMs from the CNUare inversely permuted, and then passed to the VNU along with the as-sociated channel message to perform variable node update as describedby (6). The VNU also generates the messages according to (7) for thetentative decoding unit (TDU) to conduct tentative decoding.

If tentative decoding does not succeed, the VNMs from VNU arethen permuted and sent back to the CNU to update the correspondingICNMs, which are then stored in the RAMd. After this, the decoderproceeds to another variable node. This process continues until thechecksum unit (CSU) decides to terminate because of either successfuldecoding or reaching the limit of iterations.

B. Variable Length SorterSorting is an important operation in the CNU and VNU. In the CNU,

the sorter chooses

elements with the largest belief information asrequired in (5). In the VNU, truncation as shown in (6) discards thesmaller elements and the remaining elements are sorted in order as theinput of CNU. The basic sorting operation is to insert a new elementinto a vector according to its magnitude. We will take the sorter in theCNU as an example to explain its design.

The proposed variable length sorter is illustrated in Fig. 2, whereonly the data path of belief information is shown as it determines theshifting operation. Other data, such as the Galois field element andvector index, are associated with the belief information and move alongwith it. The sorter consists of

stages to accommodate the longestmessage length. Each stage contains a comparator and a register. Eachtime a new belief information comes in, it is compared with all the


Fig. 2. Implementation of the variable length sorter, where only the data pathof the belief information is shown for simplicity.

belief information in the active stages in parallel. The content of thefirst stage having a larger belief information will be replaced by thenew input, whereas its original belief information will be shifted to theright stage and so on. Employing AMC, the message length reducesgradually. Thus only the last

stages are enabled. This is controlledby the longer one of the two messages in the CNU or VNU.

As indicated, the major operations in the sorter are comparisons andshiftings. The number of these operations is mainly determined by thelength of the message to be processed. The proposed AMC dynamicallyadjusts the message length during the iteration, thereby reducing thecomplexity of sorting.

C. Searching

Searching is another major operation in the CNU and VNU. In theVNU, the belief information in one message needs to search for itscounterpart in another message to perform (2). Thus, the location (e.g.,index) of a Galois field element needs to be determined. The mappinginformation between the Galois field elements and their indexes in amessage is maintained by a mapping table of words with wordlength of bits. For example, if the th entry in the message is asso-ciated with the Galois field element, then is written into the mappingtable at the address . When an element (e.g., ) needs to be searchedin message , the content at the address of the mapping table is readout, which provides the location of in . The table is initialized with. Thus a negative value indicates that is not in the message .

In the CNU,

different elements need to be generated accordingto (1). The output of the sorter is considered valid if and only if thecorresponding Galois field element has not been generated previously.A searching operation has to be conducted on the current output tocompare it with the previous outputs of the sorter. As only the existenceof the element needs to be determined, a mapping table with only bits can be constructed. Since this table mainly records the status (i.e.,existence or not) of Galois field elements, it is referred to as status tablein the CNU.

Note that the proposed mapping-based searching scheme is naturallylow complexity in comparison with direct searching of the message,especially when the message is very long.

D. Variable Node Update Unit

The function of VNU is to compute (2), where the belief informa-tion associated with the same Galois field element in two messages aresummed. Fig. 3 shows the block diagram of the VNU. In general, twomessages

and

may have different lengths due to the AMC.The mapping table (see Section III-C) is utilized to sum up the entries

associated with the same finite filed element. The final results are sent tothe variable length sorter with the associated Galois field elements andAMC is then applied to the outputs of the sorter, i.e., starting from thelargest entry in the sorter, when the entry is smaller than the threshold[see (4)], the following entries are discarded. This reduces the hardware

Fig. 3. Implementation of the VNU unit.

Fig. 4. Implementation of the CNU unit.

operations in the VNU such as real number additions/subtractions (forbelief information), comparisons, and sorting operations.

E. Check Node Update UnitThe CNU produces the

largest elements among all the

combinations from two input messages with lengths

and

(assume

) as described in (1). It was shown [12] that the complexity ofCNU can be reduced to

additions and insertion operations if thetwo input messages are sorted in the descending order. We adopt thismethod in the CNU design.

The complexity of the CNU depends on two factors. The first factoris the number of outputs (i.e.,

), which determines how many in-sertion operations are needed. The second factor is the length of thesorter, which determines how many comparisons and shifting opera-tions are involved in each insertion operation. Employing the proposedAMC, the message length decreases thus

becomes smaller duringthe iteration. This reduces the number of arithmetic operations and in-sertions. To address the second factor, the sorter is configured to bethe same length of the shorter message (

in the above example),thereby reducing the complexity of each insertion operation. Note thatthe CNU also performs additions in the Galois field. However, addi-tions in the Galois field are essentially bit-wise XOR operations, thusthe complexity is much lower than the real number additions of beliefinformation.

The proposed design of CNU is shown in Fig. 4. A variable lengthsorter as described in Section III-B is employed to perform shifting andcomparison operations. Searching operations are based on the methoddiscussed in Section III-C.

IV. SIMULATION RESULTS AND DISCUSSIONSIn this section, we evaluate the proposed AMC-based decoder using

an irregular GF(16) nonbinary LDPC code that has been used in a mul-ticarrier underwater acoustic system [17]. We will compare key oper-ations such as memory access, real number addition (for belief infor-mation), comparison, shifting, Galois field multiplication and division(for permutation and inverse permutation) with the existing decoderarchitectures. The hardware-related measures are also studied with theoverheads of the AMC being considered. In all simulations the code-words are transmitted through the binary AWGN channel. The max-imum number of decoding iterations is set to 10.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 2121

TABLE IHARDWARE OPERATION REDUCTION OF EMS, AMC-EMS, AND AMC-MS COMPARED WITH MS

TABLE IIESTIMATED HARDWARE MEASURES OF MS, EMS, AMC-EMS

Fig. 5. Performance of joint AMC and EMS with different and .

A. Decoding PerformanceOur past work [18] has shown that applying AMC to MS incurs

less than 0.1 dB loss at the block-error-rate (BLER) of with , and the performance loss increases to 0.35 dB when . In comparison, EMS when reducing

from 16 to 12and 10 results in the performance loss of 0.15 and 0.3 dB, respectively,at the same BLER level. One issue of AMC-MS is that the hardwarecomplexity and throughput is determined by the worst case messagelength, even though the operational complexity can be significantly re-duced. To address this issue, we propose to apply AMC with EMS inthis paper. Fig. 5 shows the BLER performance of AMC with EMSinitialization. The AMC with and incurs about0.03 and 0.1 dB performance loss, respectively, to the conventionalEMS. However, AMC can further reduce the hardware operations ofEMS under the same throughput and memory size.

B. Complexity Reduction by the AMCWe choose three cases: EMS with

, AMC-MS with , and AMC-EMS with and

, to evaluate thereduction in hardware operational complexity. The results are listed inTable I as compared with MS at SNR 2.8 dB, where BLER can reach

. Clearly, AMC-MS is more effective in reducing hardware oper-ations than EMS; i.e., the number of major operations in AMC-MS isless than 50% of that of EMS. By applying AMC to EMS, the hard-ware operations can be further reduced at the expense of negligibleperformance loss (see Fig. 5). It is expected that by reducing hardwareoperations, the proposed AMC will also enable significant reduction inenergy consumption.

C. Comparison of Hardware ImplementationsAlthough the physical implementation of the proposed decoder ar-

chitecture is beyond the scope of this paper, hardware-related measuresare necessary to evaluate the decoder efficiency. In this subsection, wewill study the hardware cost, throughput (latency), and power savingsof the proposed decoder architecture. Similar to the existing work [14],[15], these results will be estimated and compared with existing de-coders.

As discussed in Section III, the AMC-based decoder needs sortersfor truncation operations. However, the amount of hardware opera-tions is reduced at the CNU and VNU due to AMC, which can offsetthe overhead of the sorters. Note that the CNU dominates the decoderhardware complexity. Thus, we will focus on this unit for comparingthe hardware cost. The major components in the CNU are: 1) statustable and message RAM; 2) registers; 3) MUX; 4) comparators; 5)adders; and 6) decoder for sorter control as shown in Fig. 2. Comparedwith EMS, AMC-EMS needs an additional decoder that generates thecontrol signal for the variable sorter. As shown in Table II, this de-coder takes about 2.2% of the total transistor count in the CNU. Also,the slightly larger memory size (3.76%) in AMC-EMS compared withEMS comes from the overhead to record the length of each variablemessage, while EMS keeps messages with a constant length. The over-head in the VNU for AMC is also very small, consisting only of a de-coder and a comparator. Both AMC-EMS and EMS (

and bits quantization) save about 30% of the hardware cost as com-pared with MS. Note that since the proposed AMC is applied to theCNU and VNU only, the hardware cost of other units in Fig. 1 in dif-ferent implementations is more or less the same and thus we do notinclude them in the comparison.

For all the three decoders in Table II, the critical path is in the sorter,which is comprised of a comparator, an XOR gate and a 2-to-1 MUX.The critical path consists of 12 2-inputs XOR gates for MS and EMS,assuming a 5-bit carry ripple chain comparator implementation. ForAMC-EMS, the cell enable (CE) signal of the sorter is generated by


the outputs from both the decoder and comparator, thereby adding onemore 2-input AND gate on the critical path. In a full parallel decoderimplementation, all the messages are updated simultaneously. Thusthe worst case message length determines the number of clock cy-cles to finish one decoding iteration, although the message length forAMC-EMS will be reduced during the iterations. The AMC-EMS hasslightly lower throughput (about 4%) than EMS due to the longer crit-ical path in the sorter, while it takes the same number of clock cyclesto finish one iteration. As MS needs to compute 32 messages whileAMC-EMS (and EMS) only needs to compute 20 in the CNU, thethroughput can actually be improved by by truncation as com-pared with MS. On the other hand, in a sequential implementationthe messages are updated in series. As the message length varies a lotin AMC-EMS, so does the number of clock cycles spent on updatingthe messages in each iteration. Thus, in a sequential architecture, thetotal number of clock cycles for finishing decoding iterations is deter-mined by the average message length. The average AMC-EMS mes-sage length is about 60% of that of EMS, indicating about higherthroughput than EMS due to a smaller amount of message computation.Compared with MS, sequential AMC-EMS can improve the throughputby about .

Although AMC-EMS has slightly higher hardware cost and smallerthroughput (in a full parallel implementation) than EMS, it enablessignificant power savings critical to emerging applications such as un-derwater acoustic communications [17], where one of the very scarceresources is energy. Note that so far no measured results of powerconsumption from the implementations of nonbinary LDPC decoderscan be found in the literature. However, the power consumption canbe reasonably estimated by considering the major hardware operationsthat dominate the power consumption, such as memory access, addi-tion, comparison, shifting, Galois filed multiplication and division, asshown in Table I. Assume that each kind of these operations consumesabout the same power for MS, EMS, and AMC-EMS. Note that this as-sumption is conservative because EMS and AMC-EMS use a smallermemory size. As shown in Table II, among all the major operations,the AMC-EMS shows the smallest reduction in memory accesses andreal additions, which indicates that AMC-EMS can potentially reduceabout 65% and 50% of power consumption in memory accesses andreal additions, as compared with MS and EMS, respectively, under thesame clock frequency. The power reduction in other operations, suchas comparison and shifting, is even larger. Thus, we expect the overallpower reduction of AMC-EMS to reach about 65% and 50% comparedwith MS and EMS, respectively, assuming an equivalent implementa-tion.

V. CONCLUSION

In this paper, we proposed a new nonbinary LDPC decoding archi-tecture based on AMC, which can significantly reduce the hardwareoperations and power consumption in VLSI implementations by adap-tively adjusting the message length of belief information while main-taining the required performance. A truncation scheme was developedto implement the proposed AMC. The architecture design was opti-mized to fully exploit the benefit of the proposed AMC. Further work isbeing directed towards a full fledged ASIC implementation for under-water acoustic sensor networks subject to stringent energy constraints.

REFERENCES[1] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA:

MIT Press, 1963.[2] D. J. C. MacKay and R. Neal, Good codes based on very sparse

matrices, in Proc. Cryptography Coding, 5th IMA Conf., 1995, pp.100111.

[3] T. Zhang and K. K. Parhi, VLSI implementation-oriented (3,k) reg-ular low-density parity-check codes, in Proc. IEEE Workshop SignalProcess. Syst. (SIPS), 2001, pp. 2536.

[4] M. M. Mansour and N. R. Shanbhag, High-throughput LDPC de-coders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11,no. 6, pp. 976996, Dec. 2003.

[5] J. Sha, Z. Wang, M. Gao, and L. Li, Multi-Gb/s LDPC code design andimplementation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 17, no. 2, pp. 262268, Feb. 2009.

[6] M. Davey and D. J. C. MacKay, Low density parity check codes overGF(q), IEEE Commun. Lett., vol. 2, no. 6, pp. 165167, Jun. 1998.

[7] R. Peng and R. Chen, Application of nonbinary LDPC codes for com-munication over fading channels using higher order modulations, inProc. IEEE Globecom, 2006, pp. 15.

[8] V. Savin, Min-Max decoding for nonbinary LDPC codes, in Proc.IEEE ISIT, 2008, pp. 960964.

[9] H. Song and J. R. Cruz, Reduced-complexity decoding of q-ary LDPCcodes for magnetic recording, IEEE Trans. Magn., vol. 39, no. 2, pp.10811087, Mar. 2003.

[10] D. Declercq and M. Fossorier, Decoding algorithms for nonbinaryLDPC codes over GF(q), IEEE Trans. Commun., vol. 55, no. 4, pp.633643, Apr. 2007.

[11] H. Wymeersch, H. Steendam, and M. Moeneclaey, Log-domain de-coding of LDPC codes over GF(q), in Proc. IEEE ICC, 2004, pp.772776.

[12] A. Voicila, D. Decercq, F. Verdier, M. Fossorier, and P. Urard, Low-compleixty, low memory EMS algorithm for non-binary LDPC codes,in Proc. ICC, 2007, pp. 671676.

[13] Y. Chang, A. Vila Casado, M. Chang, and R. D. Wesel, Lower-com-plexity layered belief-propagation decoding of LDPC codes, in Proc.ICC, 2008, pp. 11551160.

[14] X. Zhang and F. Cai, Efficient partial-parallel decoder architecture forquasi-cyclic nonbinary LDPC codes, IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 58, no. 2, pp. 402414, Feb. 2010.

[15] J. Lin, J. Sha, and Z. Wang, An efficient VLSI architecture for nonbi-nary LDPC decoders, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol.57, no. 1, pp. 5155, Jan. 2010.

[16] A. Voicila, F. Verdier, D. Decercq, M. Fossorier, and P. Urard, Archi-tecture of a low-complexity non-binary LDPC decoder for high orderfields, in Proc. ISCIT, 2007, pp. 12011206.

[17] J. Huang, S. Zhou, and P. Willet, Nonbinary LDPC coding formulticarrier underwater acoustic communication, IEEE J. Sel. AreasCommun., vol. 26, no. 9, pp. 16841696, Sep. 2008.

[18] W. Tang, J. Huang, L. Wang, and S. Zhou, Nonbinary LDPC de-coding by Min-Sum with adaptive message control, in Proc. Int. Conf.Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 31643167.

Documents

06018325