7
Semi-Parallel Reconfigurable Architectures for Real-Time LDPC Decoding Marjan Karkooti and Joseph R. Cavallaro Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, 6100 Main St., Houston, TX 77005-1892. marjan, cavallar @rice.edu Abstract This paper presents a semi-parallel architecture for de- coding Low Density Parity Check (LDPC) codes. A modi- fied version of Min-Sum algorithm has been used which has the advantage of simpler computations compared to Sum- Product algorithm without any loss in performance. Special structure of the parity check matrix of the proposed code leads to an efficient semi-parallel implementation of the de- coder for a family of LDPC codes. A prototype archi- tecture has been implemented in VHDL on programmable hardware. The design is easily scalable and reconfigurable for larger block sizes. Simulation results show that our pro- posed decoder for a block length of bits can achieve data rates up to Mbps. Keywords: Reconfigurable architecture, FPGA imple- mentation, channel coding, parallel architecture, area-time tradeoffs. 1. Introduction Future generations of wireless devices will need to trans- mit and receive high data rate information in real-time. This poses a challenge to find an optimal coding scheme that has good performance and can be efficiently implemented in hardware. Error correcting codes insert redundancy into the transmitted data stream so that the receiver can detect and possibly correct errors that occur during transmission. Low Density Parity Check(LDPC) codes are a special case of error correcting codes that have recently been re- ceiving a lot of attention because of their very high through- put and very good decoding performance. Inherent paral- lelism of the decoding algorithm for LDPC codes, makes it very suitable for hardware implementation. Gallager [4] proposed LDPC codes in the early , but his work received no attention until after the invention of turbo codes, which used the same concept of iterative de- coding. In 1996, MacKay and Neal [7] re-discovered LDPC codes. While standards for Viterbi and turbo codes have emerged for communication applications, the flexibility of designing LDPC codes allows for a larger family of codes and encoder/decoder structures. Some initial proposals for LDPC codes for DVB-S2 are emerging [6]. In the last few years some work has been done on design- ing architectures for LDPC coding. This area is still very hot and researchers are looking for the best design in the trade-offs between area, time, power consumption and per- formance. Here we mention some of the most related work in this area. Blanksby and Howland [1] directly mapped the Sum-Product decoding algorithm to hardware. They used the fully parallel approach and connected all the functional units with wires regarding the Tanner graph connections. Although this decoder has very good performance, the rout- ing complexity and overhead makes this approach infeasi- ble for larger block lengths (e.g. more than bits). Also, implementation of all the processing units enlarges the area of the chip. Another approach is to have a semi-parallel decoder, in which the functional units are reused in order to decrease the chip-area. Semi-parallel architecture takes more time to decode the codeword and the throughput is lower than a fully parallel architecture. Zhang [11] offered an FPGA implementation of a regular LDPC semi-parallel de- coder which achieves up to Mbps symbol decoding throughput. He used a multi-layered interconnection net- work to access messages from memory. Mansour [8] pro- posed a bit, rate regular semi-parallel de- coder architecture which is low power. He used a fully- structured parity check matrix which led to a simpler mem- ory addressing scheme than [11]. Chen [2] implemented a semi-parallel architecture for a rate , bit irregular LDPC code both on FPGA and ASIC. They used a multi- plexer network to select the special inputs for the process- ing units. Their architecture can achieve up to Mbps for Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

Semi-Parallel Reconfigurable Architectures for Real-Time LDPC Decoding

Marjan Karkooti and Joseph R. Cavallaro

Center for Multimedia CommunicationDepartment of Electrical and Computer Engineering

Rice University, 6100 Main St., Houston, TX 77005-1892.�marjan, cavallar�@rice.edu

Abstract

This paper presents a semi-parallel architecture for de-coding Low Density Parity Check (LDPC) codes. A modi-fied version of Min-Sum algorithm has been used which hasthe advantage of simpler computations compared to Sum-Product algorithm without any loss in performance. Specialstructure of the parity check matrix of the proposed codeleads to an efficient semi-parallel implementation of the de-coder for a family of ��� �� LDPC codes. A prototype archi-tecture has been implemented in VHDL on programmablehardware. The design is easily scalable and reconfigurablefor larger block sizes. Simulation results show that our pro-posed decoder for a block length of ���� bits can achievedata rates up to ��� Mbps.

Keywords: Reconfigurable architecture, FPGA imple-mentation, channel coding, parallel architecture, area-timetradeoffs.

1. Introduction

Future generations of wireless devices will need to trans-mit and receive high data rate information in real-time. Thisposes a challenge to find an optimal coding scheme thathas good performance and can be efficiently implementedin hardware. Error correcting codes insert redundancy intothe transmitted data stream so that the receiver can detectand possibly correct errors that occur during transmission.

Low Density Parity Check(LDPC) codes are a specialcase of error correcting codes that have recently been re-ceiving a lot of attention because of their very high through-put and very good decoding performance. Inherent paral-lelism of the decoding algorithm for LDPC codes, makes itvery suitable for hardware implementation.

Gallager [4] proposed LDPC codes in the early �����,but his work received no attention until after the invention

of turbo codes, which used the same concept of iterative de-coding. In 1996, MacKay and Neal [7] re-discovered LDPCcodes. While standards for Viterbi and turbo codes haveemerged for communication applications, the flexibility ofdesigning LDPC codes allows for a larger family of codesand encoder/decoder structures. Some initial proposals forLDPC codes for DVB-S2 are emerging [6].

In the last few years some work has been done on design-ing architectures for LDPC coding. This area is still veryhot and researchers are looking for the best design in thetrade-offs between area, time, power consumption and per-formance. Here we mention some of the most related workin this area. Blanksby and Howland [1] directly mapped theSum-Product decoding algorithm to hardware. They usedthe fully parallel approach and connected all the functionalunits with wires regarding the Tanner graph connections.Although this decoder has very good performance, the rout-ing complexity and overhead makes this approach infeasi-ble for larger block lengths (e.g. more than � bits). Also,implementation of all the processing units enlarges the areaof the chip.

Another approach is to have a semi-parallel decoder, inwhich the functional units are reused in order to decreasethe chip-area. Semi-parallel architecture takes more timeto decode the codeword and the throughput is lower thana fully parallel architecture. Zhang [11] offered an FPGAimplementation of a ��� �� regular LDPC semi-parallel de-coder which achieves up to � Mbps symbol decodingthroughput. He used a multi-layered interconnection net-work to access messages from memory. Mansour [8] pro-posed a ��� bit, rate �� ��� �� regular semi-parallel de-coder architecture which is low power. He used a fully-structured parity check matrix which led to a simpler mem-ory addressing scheme than [11]. Chen [2] implemented asemi-parallel architecture for a rate ���, ��� bit irregularLDPC code both on FPGA and ASIC. They used a multi-plexer network to select the special inputs for the process-ing units. Their architecture can achieve up to Mbps for

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 2: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

X1

X8

X7

X6

X5

X4

X3

X2

f1

f4

f3

f2

Check Nodes

Bit Nodes

H =

1 0 1 0 1 0 1 01 0 0 1 0 1 0 10 1 1 0 0 1 1 00 1 0 1 1 0 0 1

Figure 1. Tanner graph of a parity check matrix.

FPGA and ���Mbps for ASIC. All these architectures haveused either Sum-Product or BCJR algorithms.

Contributions of this paper are as follows: First, we de-signed a structured parity check matrix which is suitable forsemi-parallel hardware design and is very efficient in termsof the memory usage. Instead of storing the locations forall the ���� in the matrix, we can store certain “block shiftvalues” and then restore the addresses using counters. Sec-ond, we introduce a semi-parallel architecture for decod-ing LDPC codes that is scalable to be used for a variety ofblock lengths. The decoder is the first implementation ofModified Min-Sum algorithm and achieves very good per-formance with low complexity.

The paper is organized as follows: Sections 2 and 3will give an overview of LDPC codes and their encod-ing/decoding algorithms. Section 4 proposes the architec-ture for LDPC decoder. Implementation issues and resultswill be discussed in this part. We will show that by usinga structured parity check matrix, a scalable hardware archi-tecture has been designed. Concluding remarks will followin section 5.

2. Low Density Parity Check Codes

Low Density Parity Check codes are a class of linearblock codes corresponding to the parity check matrix � .The parity check matrix � of size�� ����� consists ofonly ����� and ���� and is very sparse which means thatthe density of ���� in this matrix is very low. Given � in-formation bits, the set of LDPC codewords in the codespace of length � , spans the null space of the parity checkmatrix � in which: �� � �.

For a ����� regular LDPC code each column of theparity check matrix � has � ���� and each row has �

����. If degrees per row or column are not constant, thenthe code is irregular. Some of the irregular codes haveshown better performance than regular ones [3], but irreg-ularity results in more complex hardware and inefficiencyin terms of re-usability of functional units. In this workwe have considered regular codes to achieve full utilizationof processing units. Code rate � is equal to � � whichmeans that �� ��� redundant bits have been added to themessage so as to correct the errors.

LDPC codes can be represented effectively by a bi-partite graph called a “Tanner” graph. There are twoclasses of nodes in a Tanner graph, “Bit Nodes” and “CheckNodes”. The Tanner graph of a code is drawn according tothe following rule: “Check node �� � � � �� ���� � � � isconnected to Bit node ��� � � �� ���� � whenever element��� in � (parity check matrix) is a ���”. Figure 1 shows aTanner graph made for a small parity check matrix � . Inthis graph each Bit node is connected to ��� check nodes(Bit degree=�) and each Check node has a degree of ����.

3. Encoding and decoding

In order to encode a message � of � bits with LDPCcodes, one might compute � �� in which is the� -bit codeword and ���� is the generator matrix of thecode. At first glance, encoding may seem to be a computa-tionally extensive task, but there exist some reduced com-plexity algorithms for encoding of the LDPC codes [10]. Inthis paper, our focus is on the decoder. We will discuss theissues in decoder design in more detail.

Min-Sum algorithm is an approximation of the sum-product algorithm in which a set of calculations on a non-linear function ���� � ����������� ��� is approximatedby a minimum function. In the literature, it has been shownthat scaling the soft information during the decoding usingMin-Sum algorithm results in better performance. By usingdensity evaluations, Heo [5] showed that scaling factor of0.8 is optimal for ��� �� LDPC code. We call this version ofthe algorithm ”Modified Min-Sum” algorithm.

Figure 2 shows a comparison between the performanceof Sum-Product, Min-Sum and Modified Min-Sum algo-rithms. It can be seen that scaling the soft information notonly compensates for the loss of performance because of ap-proximation, but also results in superior performance com-pared to the Sum-Product algorithm, because of the reduc-tion in overestimation error. Modified Min-Sum is used asthe decoding algorithm in our architecture.

Table 1 shows a comparison between the number of cal-culations needed for each of the decoding algorithms for a��� �� LDPC code in each iteration of decoding. From thetable it is clear that Modified Min-Sum algorithm substi-

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 3: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

Table 1. Complexity comparison between algorithmsper iteration.

Algorithm Addition Func. Shift

Log-Sum-Prod. ���� ��� � �� ���� ��� -Min-Sum ���� ��� � �� - -

Mod.Min-Sum ���� ��� � ��� - ��

1 1.5 2 2.5 3 3.510

−6

10−5

10−4

10−3

10−2

10−1

100

BER vs SNR , Block Size=768, Rate = 1/2

Eb/No

BE

R

Min−Sum, itr=20Log−Sum−Product, itr=20Modified−Min−Sum, itr=20

Figure 2. Comparison of different decoding algo-rithms.

tutes the costly function evaluations with addition and shift.Although Modified Min-Sum has a few more additions thanother algorithms, it is still preferred since nonlinear functionevaluations are omitted.

The function ���� � � ������������ is sensitive toquantization error which results in loss of the decoder per-formance. Either direct implementation or look up tablescan be used to implement this function. Direct implemen-tation is costly for hardware [1]. Look-up tables (LUT) arevery sensitive to the number of quantization bits and num-ber of LUT values [11]. Since in each functional unit sev-eral LUTs should be used in parallel, they can take a largearea of the chip. Omitting the need for this function in thedecoding, saves us some area and complexity.

All of the above iterative decoding algorithms have thefollowing steps; they only differ in the messages that theypass among nodes.

� Initialization: Read the values from channel in eachBit node �� and send the messages to correspondingCheck nodes �� .

� Iteration : Compute the messages at Check nodes andpass a unique message to each Bit node.

� Compute messages at Bit nodes and pass to Checknodes.

� Threshold the values calculated in each Bit node to finda codeword.

� If the codeword satisfies all the parity check equationsor if maximum number of iteration is reached thenstop, otherwise continue iterations.

We consider an AWGN (Additive White Gaussian Noise)channel and BPSK (Binary Phase Shift Keying) modulationof the signals.

4. Architecture design

The structure of the parity check matrix has a major rolein the performance of the decoder. Finding a good matrixis an essential part of the decoder design. As mentionedearlier, parity check matrix determines the connections be-tween different processing nodes in the decoder accordingto the Tanner graph. Also, degree of each node is propor-tional to the amount of computations that should be donein that node. For example a ��� �� LDPC has twice asmany connections as a ��� � code, which results in twiceas many messages to be passed across the nodes and thememory needed to store those messages is twice the mem-ory required for a ��� � code. Chung et.al.[3] showed that��� � is the best choice for rate �� LDPC code. We haveused a ��� � code in our design.

In each iteration of the decoding, first all the Checknodes receive and update their messages and then, in thenext half-iteration all the Bit nodes update their messages.If we choose to have a one-to-one relation between process-ing units in the hardware and Bit and Check nodes in theTanner graph, then the design will be fully parallel. Ob-viously, a fully parallel approach takes a large area; but isvery fast. There is also no need for central memory blocksto store the messages. They can be latched close to the pro-cessing units [1]. With this approach, the hardware designcan be fixed to relate to a special case of the parity checkmatrix.

Table 2 shows a comparison between the resources fora parallel, semi-parallel or serial implementation of the de-coder. In this table , �� is the degree of Bit nodes, �� isthe degree of the Check nodes, � is the number of the bitsper message and � is the folding factor for the semi-paralleldesign.

Implementing LDPC decoding algorithm in fully-serialarchitecture has the smallest area since it is sufficient tohave just one Bit Functional Unit (BFU) and one CheckFunctional Unit (CFU). The fully-serial approach is suitablefor Digital Signal Processors (DSPs) in which there are onlya few functional units available to use. However, speed ofthe decoding is very low in a serial decoder.

To balance the trade-off between area and time, thebest strategy is to have a semi-parallel design. This in-volves the creation of “��” CFUs and “��” BFUs, in which

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 4: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

Table 2. LDPC decoder hardware resource compari-son.

Design Fully Semi FullyParameters Parallel Parallel Serial

Code Length � � �Information

Length � � �Code Rate ��� ��� ���

BFU � ��� �CFU � �� �� ����� �

Memory Bit ��� � ���� ��� � ���� ��� � ����Wire ���� � ���� ��� � ������ ���� �����Time

Per Iteration � �� ������ ���Counter(Address � ����� � �� �

Generator)AddressDecoder � ����� � �� �

(for Memories)Scattered Several One

Memory Type Latches Memory MemoryBlocks Block

0 100 200 300 400 500 600 700

0

100

200

300

Columns

Row

s

Figure 3. Parity Check Matrix of a (3,6) LDPC code.

�� �� � � � and �� �� � and then the reuse of theseunits throughout decoding time. For semi-parallel design,the parity check matrix should be structured in order to en-able re-usability of units. Also, in order to design a fastarchitecture for LDPC decoding, we should first design agood � matrix which results in good performance. Fol-lowing the block-structured design similar to [8], we havedesigned � matrices for (�� �) LDPC codes.

Figure 3 shows the structured parity check matrix thathas been used in this paper. The matrix consists of (�� � ���) blocks of size � in which � is a power of ��. Each�� � block is an identity matrix that has been shifted to theright �� times, � � �� ���� �� � �� ���� �. The shift val-ues can be any value between � and � � �, and have beendetermined with a heuristic search for the best performancein the codes of the same structure. Our approach is dif-

1 1.5 2 2.5 3 3.510

−6

10−5

10−4

10−3

10−2

10−1

Eb/No

BE

R

Modified−Min−Sum, itr=20, Block=768Modified−Min−Sum, itr=20, Block=1536

Figure 4. Simulation results for the decoding perfor-mance of different block lengths.

ferent from [8] since the sub-block length is not a primenumber. Also, shifts are determined by simulations andsearching for the best matrix that satisfies our constraints(with the highest girth [9]). Figure 4 shows a comparisonbetween the performance of two sets of ��� �� LDPC codesof rate ��� and block lengths of �� and ��� designed withabove structure. To give some comparison points [11] usesa LDPC code of length ���� which achieves BER of ����

and ���� ���� for SNR of �� and ���dB respectively.

4.1. Reconfigurable architecture

For LDPC codes, increasing the block length results ina performance increase. That is because the Bit and Checknodes receive some extrinsic information from the nodesthat are very far from them in the block. This increasesthe error correction ability of the code. Having a scalablearchitecture which can be scaled for different block lengthsenables us to choose a suitable block length � for differentapplications. Usually � is in the order of �� � ����� forpractical uses. Our design is flexible for block lengths of� � � � �� for a (3,6) LDPC code. As an example for� � �, � is equal to ���. By choosing different valuesfor � we can get different values for the block length. Wewill discuss the statistics and design of the architecture forblock length 1536 bits. The proposed LDPC decoder canbe scaled for any block length � � � � ��. The largestblock length is determined with the physical limitations ofthe platform such as FPGA or ASIC. It should be noted thatchanging the block length is an off-line process, since a newbitstream file should be compiled to download to the FPGA.

The overall architecture for a ��� �� LDPC decoder isshown in figure 5. This semi-parallel architecture consists

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 5: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

CFU 1

BFU 96

BFU 1

CFU 48

BFU 2

CFU 2

Controller

... ...

Channel

MemInitnn=1..6MEM mn

m=1..3n=1..6

MemCode mn

Output

Figure 5. Overall architecture of a semi-parallelLDPC decoder.

of �� ��� � �� � � �� memory units �������� ��� ������� � � �� ������� to store the values passed be-tween Bit nodes and Check nodes and �� memories�������� to store the initial values read from the chan-nel. ���� ���� stores the code bits resulted from eachiteration of the decoding. This architecture has several BitFunctional Units and Check Functional Units that can bereused in each iteration. Since the code rate is ���, thereare twice as many columns in the parity check matrix asrows, which means that the number of BFUs should be twotimes the number of CFUs to balance the time spent on eachhalf-iteration. For the block length of ����, we have chosenthe parallelism factor of � � ��, which means that we have������ ������ � � CFUs and �� BFUs. Each of theseunits is used �� times in each iteration. These units performcomputations on different input sets that are synchronizedby the controller unit.

Figure 6 shows the interconnection between memories,address generators and CFUs that are used in the first halfof iterations. In each cycle ������ generate addressesof the messages for the CFUs. Split/Merge (S/M) unitspack/unpack messages to be stored/read to/from memories.To increase the parallelism factor, it is possible to pack moremessages (i.e. Æ) to put to a single memory location. Thisposes a constraint on the design of � matrix, since the shiftvalues should all be multiples of Æ. The finite state ma-chine “control unit” supervises the flow of messages in/outof memories and functional units.

Figure 7 shows the Architecture for Check FunctionalUnits (CFUs). Each CFU has�� � � inputs and � outputs.This unit computes the minimum among different choicesof five out of six inputs. CFU outputs the result to outputports corresponding to each input which is not included inthe set. For example ��� is the result of:

��� � � ���������� �������� ���� �������� (1)in which ������ is the absolute value function.

CFU 2

Con

trolle

r

CFU 1 CFU 16

S/ M S/ MS/ MS/ MS/ MS/ M

ADGC 36ADGC 35ADGC 34ADGC 33ADGC 32

CFU /MEM SET1

CFU /MEM SET3

CFU /MEM SET2

ADGC 32

...

ME

M31

Mem

Cod

e31

ME

M32

Mem

Cod

e32

ME

M33

Mem

Cod

e33

ME

M34

Mem

Cod

e34

ME

M35

Mem

Cod

e35

ME

M36

Mem

Cod

e36

Figure 6. Connections between memories, CFUsand address generators.

Min

Min

Min

Min

Min

Min

Min

Min

Min

Min

Min

Min

In1

In2

In3

In5

In6

In4

Out1

Out2

Out3

Out4

Out5

Out6ABS

ABS

ABS

ABS

ABS

ABS

SM-->2's

SM-->2's

SM-->2's

SM-->2's

SM-->2's

SM-->2's

Code

6Valid

Figure 7. Check Functional Unit (CFU) architecture

Also, during the computations of the current iteration,CFU checks the code bits resulting from the previous itera-tion to check if the code bits satisfy the corresponding par-ity check equation (step 5 of the decoding algorithm). Afterthe first half of the iteration is complete, the result of allparity checks on the codeword will be ready too. With thisstrategy, computations in Check nodes and Bit nodes canbe done continuously without the need to wait for check-ing the codeword resulting from the previous iteration. Thisincreases the speed of the decoding.

The interconnection between BFUs and memory unitsand address generators ���� is shown in figure 8. Loca-tions of the messages in the memories are such that a sin-gle address generator can service all the BFUs. Controllermakes sure that all the units are synchronized.

The architecture of a Bit Functional Unit is shown in thefigure 9. This unit adds different combinations of its inputs

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 6: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

ControllerMEM16

BFU 1ADGB

BFU

/ Mem

Set

2

S/M

BFU 16

BFU 2

MEM26

S/M

MEM36

S/M

MemInit6 S/

M

BFU

/ Mem

Set

1

BFU

/ Mem

Set

6

...

...

MemCode 16

MemCode 26

MemCode 36

Figure 8. Connections between memories, BFUsand address generators.

>>1

>>2

>>1

>>2

>>1

>>2

In1

In2

In3

Out3

Out1

Out2

InitialValue

CodeBit

+

+

+

+

+

+

+

Figure 9. Bit Functional Unit (BFU) architecture

and scales them with a scaling factor of ���� which is donewith shift and addition. Also, it thresholds the summationof its inputs to find the code-bit corresponding to that Bitnode.

This architecture can also be used for the structured ir-regular codes with some minor modifications. For example,assume that the parity check matrix of the irregular codeis similar to figure 3, but it has � block rows and � blockcolumns in which some of the blocks are full of zeros, thenwe can have an irregular code with row degrees of �� � andcolumn degrees of �� �. We should add some circuitry sothat for the blocks full of zero in the parity check matrix,it sends a zero message to the corresponding inputs of theBFU/CFUs. In this case the BFUs will have � input/outputsand CFUs will have � input/outputs.

4.2. FPGA architecture

For real-time hardware, fixed-point computations areless costly than floating point. A fixed-point decoder usesquantized values of the soft information. There is a trade-offbetween the number of quantization bits, area of the design,power consumption and performance. Using more bits de-

1 1.5 2 2.5 310

−6

10−5

10−4

10−3

10−2

10−1

100

Eb/No

BE

R

Modified Min−Sum, 4 bitsModified Min−Sum, 5 bitsModified Min−Sum, 6 bitsModified Min−Sum, Floating Point

Figure 10. Comparison between different quantiza-tion levels.

creases the bit error rate, but increases the area and powerconsumption of the chip. Also, depending on the nature ofthe messages, the number of bits used for integer or frac-tional part of the representation is important. Our simula-tions show that using � bits for the messages is enough forgood performance. These messages will be divided into onesign bit, two integer bits and two fractional bits. Figure 10shows the performance of the decoder using �� �� � bits andthe floating point version.

Since the memory blocks in the FPGA have no more thantwo ports, we need to increase the number of the messageread/writes in each clock cycle in the dual-port memories.We pack eight message values and store them in a singlememory address. This enable us to read � � � � � mes-sages per memory per cycle.

A prototype architecture has been implemented by writ-ing VHDL (Hardware Description Language) code and tar-geted to a Xilinx VirtexII-3000 FPGA. Table 3 shows theutilization statistics of the FPGA. Based on the LeonardoSpectrum synthesis tool report, the maximum Clock fre-quency of this decoder is � MHz. Considering the pa-rameters of our design, it takes � cycles to initialize thememories with the values read from the channel, �� cyclesfor each CFU and BFU half-iterations, and �� cycles to sendout the resulting codeword. Assuming that the decoder does� iterations to finish the decoding, the data rate can be cal-culated with the following equation:

�������� ������������ ��� �����������

������(2)

and,

������ ��

��� ��

��� ���

���

��

��� �

� ��

��

��� ���

��� ��� � �� ��� � ��� � �� � ��

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

Page 7: Semi-Parallel Reconfigurable Architectures for Real-Time ...cavallar/arch_course/papers/11_karkooti.pdf · issues in decoder design in more detail. Min-Sum algorithm is an approximation

Table 3. Xilinx VirtexII-3000 FPGA utilizationstatistics.

Resource Used Utilization rateSlices 11,352 79%

4 input LUTs 20,374 71%Bonded IOBs 100 14 %Block RAMs 66 68 %

In which � is the block length, � is number of the infor-mation bits, � is the packing ratio for the messages in thememories, �� is number of BFUs, and �� is the number ofCFUs. With maximum number of iterations, � � �� (worstcase), the data rate can be ��� Mbps. This architecture issuitable for a family of codes with similar structure as de-scribed earlier and different block lengths, parallelism ratiosand message lengths.

Changing the block-size of the codeword changes thesizes of the memory blocks. If we assume that the codes arestill ��� �� and have a parity check matrix similar to figure 3,then all the CFUs, BFUs and address generators can be usedfor the new architecture. The size of the memories changesand there will be a slight modification in the address gen-erator units because they should address a different numberof memory words. This can be done by changing the size ofthe counters used in the address generators. Since the coun-ters are parametric in the VHDL code, this can be done witha new compilation of the code using these new values.

4.3. LabVIEW implementation

An alternative design has been implemented using Lab-VIEW FPGA from National Instruments. This architecturehas the same characteristics as the VHDL version. The onlydifference is that it is implemented using the graphical GUIof LabVIEW and runs in the co-simulation mode. In thismodel, data input-output is done in the host PC and decod-ing in the FPGA. This enables us to use the LDPC decoderin our end-to-end communication testbed at the Center forMultimedia Communication (CMC) at Rice University andconnect it directly to National Instruments radios and otherhardware.

5. Conclusion

A semi-parallel architecture for decoding LDPC codeshas been designed and implemented on Xilinx VirtexII FP-GAs. The special structure of the parity check matrix sim-plifies the memory addressing and results in the efficientstorage of the matrix. Modified-Min-Sum algorithm has theadvantage of good decoding performance with simple com-

putations in the functional units. The semi-parallel archi-tecture is easily scalable for different block sizes, messagelengths and parallelism factors. For a ��� �� LDPC codewith the block length of ��� bits, the decoder achieves adata rate of up to ��� Mbps.

6. Acknowledgements

This work was supported in part by a National Instru-ments Fellowship, and by NSF under grants ANI-9979465,EIA-0224458, and EIA-0321266.

References

[1] A. Blanksby and C. Howland. A 690-mW 1-Gbps 1024-b,Rate-1/2 Low-Density Parity-Check Code Decoder . Jour-nal of Solid State Circuits, 37(3):404–412, Mar 2002.

[2] Y. Chen and D. Hocevar. A FPGA and ASIC Implementa-tion of Rate 1/2 8088-b Irregular Low Density Parity CheckDecoder. IEEE Global Telecommunications Conference,GLOBECOM, 2003.

[3] S. Chung, T. Richardson, and R. Urbanke. Analysis of Sum-Product Decoding of Low-Density Parity-Check Codes Us-ing a Gaussian Approximation. IEEE Trans. on Inform. The-ory, 47(2):657–670, Feb 2001.

[4] R. Gallager. Low-Density Parity-Check Codes. IRE Trans.on Inform. Theory, 8:21–28, Jan 1962.

[5] J. Heo. Analysis of Scaling Soft Information on Low Den-sity Parity Check Codes. Elect. Letters, 39(2):219–221, Jan2003.

[6] L. Lee. LDPC Code, Application to the Next GenerationWireless Communication Systems, 2003. Fall VTC, PanelPres. by Hughes Network.

[7] D. MacKay and R. Neal. Near Shannon Limit Performaceof Low Density Parity Check codes. In Elec. Letters, vol-ume 32, pages 1645–6, Aug 1996.

[8] M. Mansour and N.Shanbhag. Low Power VLSI DecoderArchitectures for LDPC Codes. Proc. of the Int. Symp. onLow Power Electronics and Design., pages 284–289, 2002.

[9] Y. Mao and A. Banihashemi. A Heuristic Search for GoodLow-Density Parity-Check Codes at Short Block Lengths.IEEE Int. Conf. on Comm., pages 41–44, Jun 2001.

[10] T. R. R. Urbanke. Efficient Encoding of Low-Density ParityCheck Codes. IEEE Trans. on Inform. Theory, 47(2):638–656, Feb 2001.

[11] T. Zhang. Efficient VLSI Architectures for Error-CorrectingCoding. PhD thesis, University of Minnesota, Jul 2002.

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE