ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3

8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3

1/20

CHAPTER 3

PARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH

SCALABLE THROUGHPUT

3.1 INTRODUCTION

A high throughput parallelism enhanced Quasi cyclic LDPC decoder

architecture is presented in this chapter. Partly parallel decoding scheme is

employed in the decoder architecture. The decoder supports fixed code rate of

with 19 different code lengths of IEEE 802.16e standard irregular LDPC codes. The

constraints in the implementation of the decoder architectures are interconnect and

storage requirement. The fully parallel architectures suffer by routing congestion for

longer codes. But, high throughput is achieved with fully parallel architectures. The

routing complexity is reduced in partly parallel architectures at the cost of

throughput. The parallelism of the architecture is enhanced with parallel factor(PF)

which is taken as 1,2 and 4 in this architecture. The throughput is scalable with

different parallel factors. The throughput of the decoder is increased with enhanced

parallelism at the cost of hardware. The performance and hardware utilization is

compared with similar architectures.

3.2 ARCHITECTURE OF THE ENHANCED PARALLELISM DECODER

The overall architecture is explained in this section. Initially, the flow of

decoding process , mapping the decoding algorithm is described. The decoding

procedure involves two phases of operations: 1. Check node computation in the first


2/20

phase and 2. Variable node update in the second phase of operation. The two

phases of operation is considered as one iteration. After completing the

maximum number of iterations defined , the decoded code bit is obtained from

the check sum. The processing flow is illustrated with the flow diagram shown in

the Figure3.1.

Figure 3.1 Flow Chart Illustrating two phase Decoding Process

INITIALIZE

Read CRAM

Check Node

process

Write VRAM

Write CRAM

Variable node

rocess

Read VRAM

A

END

Yes

No

Yes

Yes

No

Is

iteration


3/20

The two phases of operations are carried out by check node

processor(CNP) and variable node processor(VNP) respectively. The number of

check node and variable node processors are decided by the parallel factor and

size of the base matrix. The base matrix size of the IEEE 802.16e standard for

code rate is 12 24. The number of check node processors is given by 12 PF

and the number of variable node processors is 24 p.f.. A CNP computes z / PF

rows sequentially. Hence, 12 PF CNPs computes 12 z rows in z / PF clock

cycles. A VNP updates z / PF columns sequentially and hence, 24 PF VNPs

updates 24 z columns in z / PF cycles. The total number of clock cycles

required to complete these two phases is 2 ( z / PF). The decoding throughput isgiven by the Equation (1.1)

Throughput = (N R f) / ((NCT + l) Nit) (3.1)

Where N is code length,

R is code rate,

f is synthesized frequency,

NCTis total number of clock cycles given by 2 ( z / p.f.)

l latency due to pipelining

Nitis number of iterations which is set at 10 based on the MATLAB

simulation results.

The overall architecture is shown in the Figure 1.2. The architecture

has two processing units: CNPs, VNPs and data storage units: CRAMs, VRAMs

and INDEXROMs. The messages for CNPs are read from CRAMs, computed in

CNPS and written into VRAMs. The messages in the same row are read

simultaneously and sent to CNPS. The messages in CRAM are accessed

sequentially in row order and written in the same order. The function of VNP is

to add column messages along with intrinsic message. Hence, to access column

wise message from VRAM, the addresses of the column messages are stored in


4/20

INDEXROMs. The column addresses are read from the INDEXROMs and sent

to VRAM address lines to access data for VNPs. The updated messages from

VNPs for CNPs are stored in the same address locations of the CRAMs. The

processing units and data storage units are explained in detail in the following

sections.

Figure 3.2 Architecture of parallelism enhanced Quasi cyclic LDPC

Decoder

3.2.1 Check Node Processor

The incoming messages to the CNP are in the sign magnitude form.

The most significant bit (MSB) is the sign bit which is 1 for negative number

and 0 for positive number. The remaining bits give the absolute value of the

message. The intrinsic values of the neibouring variable nodes are considered as

inputs to the CNPs in the first iteration. The CNP computes the check node to

variable node messages from the incoming neibouring variable node messages as

defined by the equation ()in the section of chapter 2 is again recalled for

reference in equations 3.2 and 3.3:

Controller Unit

Column

Address

CRAM Block

CRAM 12 PF

CRAM 1

CNP Block

CNP 12 PF

CNP 1

VNP Block

VNP 12 PF

VNP 1

VRAM Block

VRAM 12 PF

VRAM 1

INDEX ROM 12

INDEXROM 1


5/20

(3.2)

(3.3)

As shown in Figure 3.3, the updated messages from the check node(C1) to

variable node V1 is computed by messages from variable nodes v2 and v3.

Figure 3.3 Check to Variable node update

The architecture of the CNP is shown in Figure 3.4. The CNP has

three parts: 1.comparator logic which computes the minimum and sub minimum

values from the absolute values of the neibouring variable node messages and

index of the minimum value. 2. XOR logic which finds the product of signs of all

the incoming messages. 3. The distributor which distributes the computed

messages to all the neibouring variable nodes. The input to the distributor areminimum and subminimum values, index of the minimum value, product of signs

of all the messages and signs of all the incoming messages. The sub minimum

value is used as the magnitude of the updating message of the neighbouring

variable node when the index of the minimum equals the index of the

neighbouring node and all other neighbouring nodes are assigned the minimum

value. The sign of out going message is obtained by XOR logic operation with

the sign of the incoming message and product of the signs of all the messages.

Check NodeC1

V1 V2 V3

Variable Nodes


6/20

SEL1

SEL5

LOW1

ABS D2

ABS(D

ABS(D3)

ABS(D4)

ABS(D5)

ABS(D6

SEL2

SEL3

HIGH1

HIGH2

HIGH 3

SEL1

LOW2

LOW3

LOW4

HIGH 4

SEL4

LOW5

HIGH4

The outgoing messages from CNP are to be processed by VNPs. The VNPs adds

updated messages in each column. To facilitate addition operation, the outgoing

messages from CNPs are converted into twos complement form.

*Low5 is the Minimum value among 6 inputs

Figure 3.4 a Minimum Value Finder with 6 inputs

Figure 3.4 b Sub Minimum Value Finder with 6 inputs

HIGH 2

LOW 1

HIGH 1

LOW 2

LOW 4

LOW 5

SEL 4

Minimum 2

Sub Minimum

LOW 3

LOW 7Minimum 2

HIGH 3SEL 6

Comparator

Comparator

Comparator

Comparator

Comparator

Comparator

Comparator

Comparator


7/20

Figure 3.4c XOR Logic

Figure 3.4d Distributer

The comparator uses pre computation logic. It has two comparator and

swap modules: CS-A and CS-B. CS-A compares and swaps the two inputs based

on the MSBs of absolute values of the two inputs. CS-B compares and swaps the

two inputs based on the remaining bits of the two inputs. XOR logic operation is

performed with the MSBs of absolute values of the two messages. The output of

the XOR logic is used to select either CS-A or CS-B. If XOR logic output is 1,

SIGN(D3)

SIGN(D4)XOR

XOR

XOR

SIGN(D1)

SIGN(D2)XOR

SIGN(D5)

SIGN(D6)XOR

SIGN(O6)

SIGN(O1)XOR

XOR

.

.

.

SIGN

Minimum

SubMinimum

SIGN O4

Index

SIGN

SIGN O1

SIGN(O2)

SIGN O3

SIGN O5

SIGN O6

Distributer.

.

.

.

.

.

COUT1

COUT6

Sign magnitude to

2s complement


8/20

it indicates MSBs of the two inputs differ and CS-A is selected. If it is 0, it

shows that MSBs of the two input value are same and hence CS-B is selected.

The compare and swap unit has 3 outputs: low, high and sel. If input1is greater

than input2, the sel output is high else it is low. The sel input is used to find

the index value of the input.

The LDPC codes of IEEE802.16e standard are irregular codes, they

have different row weights and column weights. The block rows 1,4,5,7,8,10,11

and 12 have 6 non zero circulant matrices and weight of these rows is 6. The

block rows 2,3,6 and 9 have 7 non zero matrices and weight of these row is 7.

The number of inputs to CNP is 6 , if the row weight is 6 and the number of

inputs is 7 if the row weight is 7. Hence , two different CNPs are required to

compute 6 and 7 messages.

The check node computation process is pipelined as shown in Figure 3.5 to

improve the performance.

Figure 3.5. Pipelining in the Check node computation phase

3.2.2 Variable Node Processor

The variable node processor is simply an adder. As shown in Figure 3.6, the

updated messages from the variable node(V1) to check nodes C1 is computed by

messages from variable nodes C2 and C3.

READ (1)

CRAM

READ (2)

CRAM

READ (3)

CRAM

READ (4)

CRAM

WRITE (1)

VRAM

WRITE (2)

VRAM


9/20

Figure 3.6 Check to Variable node update

The VNP has two parts: the first part simply adds intrinsic message of

that particular variable node and updated messages from its neighbouring check

nodes. The second part distributes the updated messages to its neighbouring

nodes by subtracting original input message from total sum obtained from the

first part. The updated outgoing messages are converted into sign magnitude

form to be used by the CNPs. Since, the column weights of the IEEE802.16e

standard irregular codes are 2,3 and 6, three different VNPs are used to add 3,4

and 7 inputs respectively. They are represented as VNP3, VNP4 and VNP7

respectively. The architectures of VNPs are shown in Figure 3.7. The number of

adder stages between input and output of VNP3, VNP4 and VNP7 are 2,2 and 4.

To reduce the critical path delay in VNP7 pipeline registers are introduced. (i/p1

instead of LLR) LLR instead of i/p6.

V1

C1 C2 C3

Variable Node

Check Nodes


10/20

Figure 3.7 Variable Node Processor

The processing steps involved in variable node processor is shown in

Fig.3.8. The column address for VRAM is read from INDEXROM in the first

cycle and sent as address to VRAM, the column messages are read from VRAM

and sent to VNP in the second clock cycle and the updated messages by VNPs

are written into CRAM in the third cycle. These processes are overlapped as

shown in Figure 3.8 to improve the performance by introducing pipeline stages.

Figure 3.8 Variable node processing phase

READ (1)

INDEXROM

READ (2)

INDEXROM

READ (3)

INDEXROM

READ (4)

INDEXROM

READ (n-1)

INDEXROM

READ (n)

INDEXROM

READ (1)

VRAM

READ (2)

VRAM

READ (3)

VRAM

READ (n-2)

VRAMREAD (n-1)

VRAM

READ (n)

VRAM

WRITE (1)

CRAMWRITE (2)

CRAMWRITE (n-3)

CRAMWRITE (n-2)

CRAMWRITE (n-1)

CRAMWRITE (n)

CRAM

VNP out 1

VNP out 6

i/p1

LLR

i/p2

i/p3

i/p4

i/p5

i/p6

To

CRAM

+

+

+

+ +

+

-

-

2s com

toSign

2s comto

Sign


11/20

3.2.3 Memory Organization

The updated messages by CNPs and VNPs are stored in VRAMs and

CRAMs respectively. In the first phase data is read from CRAMs and given to

CNPs for processing. The processed data is then written into VRAM. After

processing all the rows in the first phase, the decoder enters into the second phase

of the decoding. In the second phase, the VRAM addresses are read from

INDEXROM, the messages for VNPs are accessed from VRAMs and written

into CRAMs after processing in the VNPS. This forms one iteration and the

above processes are repeated till the maximum iterations are reached.

The number memory blocks to store messages for CNPs and VNPs is

equal to number of block rows. Hence, the decoder has 12 CRAMs, 12 VRAMs

and 12 INDEXROMS. Each CRAM and VRAM has 6 or 7 memory banks

according to the number of non zero matrices in each block row. The messages

are quantized to 6 bits and each location stores 6bit message. The messages are

stored row wise in the memory locations that is the message in first row is stored

in first location and message in second row in the second location and so on. A

circulant matrix of size 12 12 is shown in fig.3.8 and the corresponding

memory organization for different parallel factors are shown in Fig.3.9. The non

zero message in the first row is at fifth column position and non zero message is

in sixth column position and so on. With parallel factor = 1, the messages are

stored row wise as shown in fig8a. If the parallel factor is 2, the messages of

row1 to row 6 are stored in bank1 and messages of row 7 to row 12 are stored in

bank2 as shown in fig.8b. If p.f. =4 , four messages are packed as single word

and stored in a single location. Hence, the number of locations with parallel

factor 1,2 and 4 are 12, 6 and 3 respectively.


12/20

4

5

6

7

8

9

10

11

12

1

2

3

Figure 3.8 Circulant matrix of size 12 12

PF = 4

PF = 2

PF = 1

Figure 3.9 Illustration of Data storage with different parallel factors

The data is stored in the same fashion as illustrated in the figure3.9 in

CRAM and VRAM. To access column wise data from VRAM, the column

addresses are stored in INDEXROM. In the given example for PF = 1, the

column 1 message is stored in location 10 and 10 is stored in location 1 of

Memory

location

Data

1 4

2 5

3 6

4 75 8

6 9

.

.

.

.

.

.

11 2

12 3

Memory

location

Data

In

bank1

Data

In

bank2

1 4 10

2 5 113 6 12

4 7 1

5 8 2

6 9 3

Memory

location

Bank1

1 4,7,10,1

2 5,8,11,2

3 6,9,12,3


13/20

INDEXROM and so on. If PF = 2, the column 1, column2 and column3

messages are stored in bank2 and column 4 , column5 and column6 messages are

stored in bank1. Hence , a tag is attached with the column address to identify the

bank.

The contents of INDEXROM for PF = 1 and 2 is given in the Figure

3.10. In the case of PF =1, the data is accessed in sequence from the locations

10,11,12,1,2,...,9. With PF = 2, the bolded bit in fig indicates the tag of the bank.

The remaining bits represent the location of bank1 and ban2. For illustrated

circulant matrix, With PF = 2, 2 VNPs are assigned for each block column. For

illustrated circulant matrix, VNPA processes columns 1 to 6 and VNPB

processes columns 7 -12. If tag is 1, the memory location4 of the bank2 is sent to

VNPA and memory location 4 of bank1 is to VNPB. If tag is 0, the message

from the given memory location of bank 1 is sent to VNPA and the message

from the given memory location of bank 2 is sent to VNPB.

PF = 2

PF = 1

Figure 3.10 Contents of IndexROM with parallel factors 1 and 2

Memorylocation

Data

1 1010

2 1011

3 1100

4 0001

5 0010

6 0011

.

.

.

.

.

.

11 1000

12 1001

Memory

location

Data

1 1100

2 1101

3 1110

4 0001

5 0010

6 0011


14/20

If PF is 4, four data are merged and stored in one location and these

four data from CRAM are sent to four CNPs. Each CNP receives 6 or 7

messages from 6 or 7 CRAM banks. After processed by CNPS, the data are

stored in the same location of the VRAM. But , to access column wise data, the

data from CNP shuffled and stored in VRAM. The data in fig is shuffled by a

shuffle network and stored in VRAM as shown in Figure 3.11.

Figure 3.11 Contents of CRAM and VRAM

The shuffled four data , data1 is sent to VNPA, data2 is sent to VNPB,

data 3 is sent to VNPC and data 4 is sent to VNPD. After processed in the VNPs

the are reshuffled and stored in the same location of the CRAM.

The total number of memory bits for various code lengths is shown in

Table 3.1

Table 3.1 Total memory Requirement for various code lengths

Code lengths

576 1152 2304

p.f.=1 P.f.=2 p.f.=1 P.f.=2 p.f.=1 P.f.=2

CRAM (Bits) 10,944 10,944 21,888 21,888 43776 43776

VRAM(Bits) 10,944 10,944 21,888 21,888 43776 43776

INDEXROM

(Bits)

9360 5,184 22,464 11,232 52436 22176

Total 32,248 27,072 66,240 55,008 139968 109728

Memory

location

Bank1

1 4,7,10,12 5,8,11,2

3 6,9,12,3

Memory

location

Bank1

1 1,4,7,10

2 2,5,8,11

3 3,6,9,12


15/20


16/20


17/20

Table 3.2 Comparison of Device Utilization and throughput of Architectures

with and without pipelining for various code lengths

The hardware and throughput are compared for the code lengths 576,

864, 1152, 1768 and 2304 . The comparison is given in table 3.3.


with pipelining for various code lengths

Code

lengths

576 1152 2304

WOP WP WOP WP WOP WP

Slice

registers

13100 13925 24055 24891 45358 46194

Slices 1880 4318 1939 1983 1076 2140

LUTs 8101 8183 12209 13440 20084 20901

BRAMs 16 12 18 14 56 56

Memory

bits

32,248 66,240 139988

Clock

cycles

72 54 144 102 288 198

Clock

(MHz)

126 226 126 226 180 226

Throughput

(Mbps)

50.4 120.5 50.4 127.62 72 131.5

Codelengths

576 864 1152 1728 2304

Slice

registers

13925 19490 24891 35850 46194

Slices 4318 1982 1983 2596 2140

LUTs 8183 11463 13440 19158 20901

BRAMs 12 14 14 18 56

Memory 32,248 49680 66,240 104976 139988


18/20

The hardware and throughput of the architectures using block rams

and distributed rams in FPGA are compared which is illustrated in Table 3.4. The

architectures designed for code length 2304 bit with code rate 0.5 and parallel

factor =1. The slice utilization is higher and the block ram usage is less in the

architecture designed with distributed ram. The throughput is increased in the by

8% in the architecture using block ram.


with block ram and distributed ram for code length 2304 bits/0.5 code rate

Distributed

RAM

BlockRAM

Slice

registers

46194 3084

Slices 2140 -

LUTs 20901 4603

BRAMs 56 76

Memory

bits

139988

Clock

cycles

198 199

Clock(MHz)

226 246

bits

Clock

cycles

54 78 102 150 198

Clock

(MHz)

226 226 226 226 226

Throughput

(Mbps)

120.5 125.17 127.62 130 131.5


19/20

Throughput

(Mbps)

131.5 142

The architectures with parallel factors 1,2 and 4 designed to

support the code length 2304 bits/ 0.5 code rate are compared. The table 3.5

gives the comparison of hardware and throughput of architectures with {F

=1,2,and 4. As parallel factor increases, the throughput increases with hardware.


with parallel factors 1,2 and 4 designed to support code length 2304

bits/0.5 code rate

PF=1 PF=2 PF=4

Slice

registers

46194 39913 11493

Slices 2140 2238 -

LUTs 20901 23826 18302

BRAMs 56 12 160

Memory

bits

139988

Clock

cycles

198 102 54

Clock

(MHz)

226 185 221

Throughput

(Mbps)

131.5 209 472

The proposed architectures are compared with the existing

architectures which is given in the table 3.6. The proposed architecture achieves

very high throughput when compared with the existing architectures.


20/20


with parallel factors 1,2 and 4 designed to support code length 2304

bits/0.5 code rate

Proposed Vikram

(2013)

Karkoot

(2008)

No of CNPs

and VNPs

48,96 48,48 48,96 81

No. of

quantization

bits

6 4 7 7

Code length

and code

rate

2304 2304 1536

Regular

628,1296,

1944

Regular,Irregular

Application WiMAX

IEEE802.16e

standard

WiMAX

IEEE802.16

e standard

IEEE802.11n

Decoding

Algorithm

TPMP-

Minsum

TPMP-

Modified

Minsum

Modifie

d

Minsum

Layered

MMS

Slice registers 11493 2024 3455 12368

Slices - 3141 9881 11328

LUTs 18302 9547 18174 17104

BRAMs 160 87 66 87

Memory (bits) 87532 20736 NA

Clock cycles 54 78 - -

Clock (MHz) 221 144 211

Throughput

(Mbps)

472

(source datarate)

266

(source datarate)

397

(datarate)

Code

length1296

1944

56 84

FPGA Virtex V Virtex V Virtex4

Documents

ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3