Upload
ayshwarvenkateshsa
View
221
Download
0
Embed Size (px)
Citation preview
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
1/20
CHAPTER 3
PARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH
SCALABLE THROUGHPUT
3.1 INTRODUCTION
A high throughput parallelism enhanced Quasi cyclic LDPC decoder
architecture is presented in this chapter. Partly parallel decoding scheme is
employed in the decoder architecture. The decoder supports fixed code rate of
with 19 different code lengths of IEEE 802.16e standard irregular LDPC codes. The
constraints in the implementation of the decoder architectures are interconnect and
storage requirement. The fully parallel architectures suffer by routing congestion for
longer codes. But, high throughput is achieved with fully parallel architectures. The
routing complexity is reduced in partly parallel architectures at the cost of
throughput. The parallelism of the architecture is enhanced with parallel factor(PF)
which is taken as 1,2 and 4 in this architecture. The throughput is scalable with
different parallel factors. The throughput of the decoder is increased with enhanced
parallelism at the cost of hardware. The performance and hardware utilization is
compared with similar architectures.
3.2 ARCHITECTURE OF THE ENHANCED PARALLELISM DECODER
The overall architecture is explained in this section. Initially, the flow of
decoding process , mapping the decoding algorithm is described. The decoding
procedure involves two phases of operations: 1. Check node computation in the first
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
2/20
phase and 2. Variable node update in the second phase of operation. The two
phases of operation is considered as one iteration. After completing the
maximum number of iterations defined , the decoded code bit is obtained from
the check sum. The processing flow is illustrated with the flow diagram shown in
the Figure3.1.
Figure 3.1 Flow Chart Illustrating two phase Decoding Process
INITIALIZE
Read CRAM
Check Node
process
Write VRAM
Write CRAM
Variable node
rocess
Read VRAM
A
END
Yes
No
Yes
Yes
No
Is
iteration
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
3/20
The two phases of operations are carried out by check node
processor(CNP) and variable node processor(VNP) respectively. The number of
check node and variable node processors are decided by the parallel factor and
size of the base matrix. The base matrix size of the IEEE 802.16e standard for
code rate is 12 24. The number of check node processors is given by 12 PF
and the number of variable node processors is 24 p.f.. A CNP computes z / PF
rows sequentially. Hence, 12 PF CNPs computes 12 z rows in z / PF clock
cycles. A VNP updates z / PF columns sequentially and hence, 24 PF VNPs
updates 24 z columns in z / PF cycles. The total number of clock cycles
required to complete these two phases is 2 ( z / PF). The decoding throughput isgiven by the Equation (1.1)
Throughput = (N R f) / ((NCT + l) Nit) (3.1)
Where N is code length,
R is code rate,
f is synthesized frequency,
NCTis total number of clock cycles given by 2 ( z / p.f.)
l latency due to pipelining
Nitis number of iterations which is set at 10 based on the MATLAB
simulation results.
The overall architecture is shown in the Figure 1.2. The architecture
has two processing units: CNPs, VNPs and data storage units: CRAMs, VRAMs
and INDEXROMs. The messages for CNPs are read from CRAMs, computed in
CNPS and written into VRAMs. The messages in the same row are read
simultaneously and sent to CNPS. The messages in CRAM are accessed
sequentially in row order and written in the same order. The function of VNP is
to add column messages along with intrinsic message. Hence, to access column
wise message from VRAM, the addresses of the column messages are stored in
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
4/20
INDEXROMs. The column addresses are read from the INDEXROMs and sent
to VRAM address lines to access data for VNPs. The updated messages from
VNPs for CNPs are stored in the same address locations of the CRAMs. The
processing units and data storage units are explained in detail in the following
sections.
Figure 3.2 Architecture of parallelism enhanced Quasi cyclic LDPC
Decoder
3.2.1 Check Node Processor
The incoming messages to the CNP are in the sign magnitude form.
The most significant bit (MSB) is the sign bit which is 1 for negative number
and 0 for positive number. The remaining bits give the absolute value of the
message. The intrinsic values of the neibouring variable nodes are considered as
inputs to the CNPs in the first iteration. The CNP computes the check node to
variable node messages from the incoming neibouring variable node messages as
defined by the equation ()in the section of chapter 2 is again recalled for
reference in equations 3.2 and 3.3:
Controller Unit
Column
Address
CRAM Block
CRAM 12 PF
CRAM 1
CNP Block
CNP 12 PF
CNP 1
VNP Block
VNP 12 PF
VNP 1
VRAM Block
VRAM 12 PF
VRAM 1
INDEX ROM 12
INDEXROM 1
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
5/20
(3.2)
(3.3)
As shown in Figure 3.3, the updated messages from the check node(C1) to
variable node V1 is computed by messages from variable nodes v2 and v3.
Figure 3.3 Check to Variable node update
The architecture of the CNP is shown in Figure 3.4. The CNP has
three parts: 1.comparator logic which computes the minimum and sub minimum
values from the absolute values of the neibouring variable node messages and
index of the minimum value. 2. XOR logic which finds the product of signs of all
the incoming messages. 3. The distributor which distributes the computed
messages to all the neibouring variable nodes. The input to the distributor areminimum and subminimum values, index of the minimum value, product of signs
of all the messages and signs of all the incoming messages. The sub minimum
value is used as the magnitude of the updating message of the neighbouring
variable node when the index of the minimum equals the index of the
neighbouring node and all other neighbouring nodes are assigned the minimum
value. The sign of out going message is obtained by XOR logic operation with
the sign of the incoming message and product of the signs of all the messages.
Check NodeC1
V1 V2 V3
Variable Nodes
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
6/20
SEL1
SEL5
LOW1
ABS D2
ABS(D
ABS(D3)
ABS(D4)
ABS(D5)
ABS(D6
SEL2
SEL3
HIGH1
HIGH2
HIGH 3
SEL1
LOW2
LOW3
LOW4
HIGH 4
SEL4
LOW5
HIGH4
The outgoing messages from CNP are to be processed by VNPs. The VNPs adds
updated messages in each column. To facilitate addition operation, the outgoing
messages from CNPs are converted into twos complement form.
*Low5 is the Minimum value among 6 inputs
Figure 3.4 a Minimum Value Finder with 6 inputs
Figure 3.4 b Sub Minimum Value Finder with 6 inputs
HIGH 2
LOW 1
HIGH 1
LOW 2
LOW 4
LOW 5
SEL 4
Minimum 2
Sub Minimum
LOW 3
LOW 7Minimum 2
HIGH 3SEL 6
Comparator
Comparator
Comparator
Comparator
Comparator
Comparator
Comparator
Comparator
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
7/20
Figure 3.4c XOR Logic
Figure 3.4d Distributer
The comparator uses pre computation logic. It has two comparator and
swap modules: CS-A and CS-B. CS-A compares and swaps the two inputs based
on the MSBs of absolute values of the two inputs. CS-B compares and swaps the
two inputs based on the remaining bits of the two inputs. XOR logic operation is
performed with the MSBs of absolute values of the two messages. The output of
the XOR logic is used to select either CS-A or CS-B. If XOR logic output is 1,
SIGN(D3)
SIGN(D4)XOR
XOR
XOR
SIGN(D1)
SIGN(D2)XOR
SIGN(D5)
SIGN(D6)XOR
SIGN(O6)
SIGN(O1)XOR
XOR
.
.
.
SIGN
Minimum
SubMinimum
SIGN O4
Index
SIGN
SIGN O1
SIGN(O2)
SIGN O3
SIGN O5
SIGN O6
Distributer.
.
.
.
.
.
COUT1
COUT6
Sign magnitude to
2s complement
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
8/20
it indicates MSBs of the two inputs differ and CS-A is selected. If it is 0, it
shows that MSBs of the two input value are same and hence CS-B is selected.
The compare and swap unit has 3 outputs: low, high and sel. If input1is greater
than input2, the sel output is high else it is low. The sel input is used to find
the index value of the input.
The LDPC codes of IEEE802.16e standard are irregular codes, they
have different row weights and column weights. The block rows 1,4,5,7,8,10,11
and 12 have 6 non zero circulant matrices and weight of these rows is 6. The
block rows 2,3,6 and 9 have 7 non zero matrices and weight of these row is 7.
The number of inputs to CNP is 6 , if the row weight is 6 and the number of
inputs is 7 if the row weight is 7. Hence , two different CNPs are required to
compute 6 and 7 messages.
The check node computation process is pipelined as shown in Figure 3.5 to
improve the performance.
Figure 3.5. Pipelining in the Check node computation phase
3.2.2 Variable Node Processor
The variable node processor is simply an adder. As shown in Figure 3.6, the
updated messages from the variable node(V1) to check nodes C1 is computed by
messages from variable nodes C2 and C3.
READ (1)
CRAM
READ (2)
CRAM
READ (3)
CRAM
READ (4)
CRAM
WRITE (1)
VRAM
WRITE (2)
VRAM
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
9/20
Figure 3.6 Check to Variable node update
The VNP has two parts: the first part simply adds intrinsic message of
that particular variable node and updated messages from its neighbouring check
nodes. The second part distributes the updated messages to its neighbouring
nodes by subtracting original input message from total sum obtained from the
first part. The updated outgoing messages are converted into sign magnitude
form to be used by the CNPs. Since, the column weights of the IEEE802.16e
standard irregular codes are 2,3 and 6, three different VNPs are used to add 3,4
and 7 inputs respectively. They are represented as VNP3, VNP4 and VNP7
respectively. The architectures of VNPs are shown in Figure 3.7. The number of
adder stages between input and output of VNP3, VNP4 and VNP7 are 2,2 and 4.
To reduce the critical path delay in VNP7 pipeline registers are introduced. (i/p1
instead of LLR) LLR instead of i/p6.
V1
C1 C2 C3
Variable Node
Check Nodes
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
10/20
Figure 3.7 Variable Node Processor
The processing steps involved in variable node processor is shown in
Fig.3.8. The column address for VRAM is read from INDEXROM in the first
cycle and sent as address to VRAM, the column messages are read from VRAM
and sent to VNP in the second clock cycle and the updated messages by VNPs
are written into CRAM in the third cycle. These processes are overlapped as
shown in Figure 3.8 to improve the performance by introducing pipeline stages.
Figure 3.8 Variable node processing phase
READ (1)
INDEXROM
READ (2)
INDEXROM
READ (3)
INDEXROM
READ (4)
INDEXROM
READ (n-1)
INDEXROM
READ (n)
INDEXROM
READ (1)
VRAM
READ (2)
VRAM
READ (3)
VRAM
READ (n-2)
VRAMREAD (n-1)
VRAM
READ (n)
VRAM
WRITE (1)
CRAMWRITE (2)
CRAMWRITE (n-3)
CRAMWRITE (n-2)
CRAMWRITE (n-1)
CRAMWRITE (n)
CRAM
VNP out 1
VNP out 6
i/p1
LLR
i/p2
i/p3
i/p4
i/p5
i/p6
To
CRAM
+
+
+
+ +
+
-
-
2s com
toSign
2s comto
Sign
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
11/20
3.2.3 Memory Organization
The updated messages by CNPs and VNPs are stored in VRAMs and
CRAMs respectively. In the first phase data is read from CRAMs and given to
CNPs for processing. The processed data is then written into VRAM. After
processing all the rows in the first phase, the decoder enters into the second phase
of the decoding. In the second phase, the VRAM addresses are read from
INDEXROM, the messages for VNPs are accessed from VRAMs and written
into CRAMs after processing in the VNPS. This forms one iteration and the
above processes are repeated till the maximum iterations are reached.
The number memory blocks to store messages for CNPs and VNPs is
equal to number of block rows. Hence, the decoder has 12 CRAMs, 12 VRAMs
and 12 INDEXROMS. Each CRAM and VRAM has 6 or 7 memory banks
according to the number of non zero matrices in each block row. The messages
are quantized to 6 bits and each location stores 6bit message. The messages are
stored row wise in the memory locations that is the message in first row is stored
in first location and message in second row in the second location and so on. A
circulant matrix of size 12 12 is shown in fig.3.8 and the corresponding
memory organization for different parallel factors are shown in Fig.3.9. The non
zero message in the first row is at fifth column position and non zero message is
in sixth column position and so on. With parallel factor = 1, the messages are
stored row wise as shown in fig8a. If the parallel factor is 2, the messages of
row1 to row 6 are stored in bank1 and messages of row 7 to row 12 are stored in
bank2 as shown in fig.8b. If p.f. =4 , four messages are packed as single word
and stored in a single location. Hence, the number of locations with parallel
factor 1,2 and 4 are 12, 6 and 3 respectively.
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
12/20
4
5
6
7
8
9
10
11
12
1
2
3
Figure 3.8 Circulant matrix of size 12 12
PF = 4
PF = 2
PF = 1
Figure 3.9 Illustration of Data storage with different parallel factors
The data is stored in the same fashion as illustrated in the figure3.9 in
CRAM and VRAM. To access column wise data from VRAM, the column
addresses are stored in INDEXROM. In the given example for PF = 1, the
column 1 message is stored in location 10 and 10 is stored in location 1 of
Memory
location
Data
1 4
2 5
3 6
4 75 8
6 9
.
.
.
.
.
.
11 2
12 3
Memory
location
Data
In
bank1
Data
In
bank2
1 4 10
2 5 113 6 12
4 7 1
5 8 2
6 9 3
Memory
location
Bank1
1 4,7,10,1
2 5,8,11,2
3 6,9,12,3
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
13/20
INDEXROM and so on. If PF = 2, the column 1, column2 and column3
messages are stored in bank2 and column 4 , column5 and column6 messages are
stored in bank1. Hence , a tag is attached with the column address to identify the
bank.
The contents of INDEXROM for PF = 1 and 2 is given in the Figure
3.10. In the case of PF =1, the data is accessed in sequence from the locations
10,11,12,1,2,...,9. With PF = 2, the bolded bit in fig indicates the tag of the bank.
The remaining bits represent the location of bank1 and ban2. For illustrated
circulant matrix, With PF = 2, 2 VNPs are assigned for each block column. For
illustrated circulant matrix, VNPA processes columns 1 to 6 and VNPB
processes columns 7 -12. If tag is 1, the memory location4 of the bank2 is sent to
VNPA and memory location 4 of bank1 is to VNPB. If tag is 0, the message
from the given memory location of bank 1 is sent to VNPA and the message
from the given memory location of bank 2 is sent to VNPB.
PF = 2
PF = 1
Figure 3.10 Contents of IndexROM with parallel factors 1 and 2
Memorylocation
Data
1 1010
2 1011
3 1100
4 0001
5 0010
6 0011
.
.
.
.
.
.
11 1000
12 1001
Memory
location
Data
1 1100
2 1101
3 1110
4 0001
5 0010
6 0011
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
14/20
If PF is 4, four data are merged and stored in one location and these
four data from CRAM are sent to four CNPs. Each CNP receives 6 or 7
messages from 6 or 7 CRAM banks. After processed by CNPS, the data are
stored in the same location of the VRAM. But , to access column wise data, the
data from CNP shuffled and stored in VRAM. The data in fig is shuffled by a
shuffle network and stored in VRAM as shown in Figure 3.11.
Figure 3.11 Contents of CRAM and VRAM
The shuffled four data , data1 is sent to VNPA, data2 is sent to VNPB,
data 3 is sent to VNPC and data 4 is sent to VNPD. After processed in the VNPs
the are reshuffled and stored in the same location of the CRAM.
The total number of memory bits for various code lengths is shown in
Table 3.1
Table 3.1 Total memory Requirement for various code lengths
Code lengths
576 1152 2304
p.f.=1 P.f.=2 p.f.=1 P.f.=2 p.f.=1 P.f.=2
CRAM (Bits) 10,944 10,944 21,888 21,888 43776 43776
VRAM(Bits) 10,944 10,944 21,888 21,888 43776 43776
INDEXROM
(Bits)
9360 5,184 22,464 11,232 52436 22176
Total 32,248 27,072 66,240 55,008 139968 109728
Memory
location
Bank1
1 4,7,10,12 5,8,11,2
3 6,9,12,3
Memory
location
Bank1
1 1,4,7,10
2 2,5,8,11
3 3,6,9,12
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
15/20
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
16/20
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
17/20
Table 3.2 Comparison of Device Utilization and throughput of Architectures
with and without pipelining for various code lengths
The hardware and throughput are compared for the code lengths 576,
864, 1152, 1768 and 2304 . The comparison is given in table 3.3.
Table 3.3 Comparison of Device Utilization and throughput of Architectures
with pipelining for various code lengths
Code
lengths
576 1152 2304
WOP WP WOP WP WOP WP
Slice
registers
13100 13925 24055 24891 45358 46194
Slices 1880 4318 1939 1983 1076 2140
LUTs 8101 8183 12209 13440 20084 20901
BRAMs 16 12 18 14 56 56
Memory
bits
32,248 66,240 139988
Clock
cycles
72 54 144 102 288 198
Clock
(MHz)
126 226 126 226 180 226
Throughput
(Mbps)
50.4 120.5 50.4 127.62 72 131.5
Codelengths
576 864 1152 1728 2304
Slice
registers
13925 19490 24891 35850 46194
Slices 4318 1982 1983 2596 2140
LUTs 8183 11463 13440 19158 20901
BRAMs 12 14 14 18 56
Memory 32,248 49680 66,240 104976 139988
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
18/20
The hardware and throughput of the architectures using block rams
and distributed rams in FPGA are compared which is illustrated in Table 3.4. The
architectures designed for code length 2304 bit with code rate 0.5 and parallel
factor =1. The slice utilization is higher and the block ram usage is less in the
architecture designed with distributed ram. The throughput is increased in the by
8% in the architecture using block ram.
Table 3.4 Comparison of Device Utilization and throughput of Architectures
with block ram and distributed ram for code length 2304 bits/0.5 code rate
Distributed
RAM
BlockRAM
Slice
registers
46194 3084
Slices 2140 -
LUTs 20901 4603
BRAMs 56 76
Memory
bits
139988
Clock
cycles
198 199
Clock(MHz)
226 246
bits
Clock
cycles
54 78 102 150 198
Clock
(MHz)
226 226 226 226 226
Throughput
(Mbps)
120.5 125.17 127.62 130 131.5
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
19/20
Throughput
(Mbps)
131.5 142
The architectures with parallel factors 1,2 and 4 designed to
support the code length 2304 bits/ 0.5 code rate are compared. The table 3.5
gives the comparison of hardware and throughput of architectures with {F
=1,2,and 4. As parallel factor increases, the throughput increases with hardware.
Table 3.5 Comparison of Device Utilization and throughput of Architectures
with parallel factors 1,2 and 4 designed to support code length 2304
bits/0.5 code rate
PF=1 PF=2 PF=4
Slice
registers
46194 39913 11493
Slices 2140 2238 -
LUTs 20901 23826 18302
BRAMs 56 12 160
Memory
bits
139988
Clock
cycles
198 102 54
Clock
(MHz)
226 185 221
Throughput
(Mbps)
131.5 209 472
The proposed architectures are compared with the existing
architectures which is given in the table 3.6. The proposed architecture achieves
very high throughput when compared with the existing architectures.
8/12/2019 ChaPARTIALLY PARALLEL LDPC DECODER ARCHITECTURE WITH SCALABLE THROUGHPUTpter 3
20/20
Table 3.6 Comparison of Device Utilization and throughput of Architectures
with parallel factors 1,2 and 4 designed to support code length 2304
bits/0.5 code rate
Proposed Vikram
(2013)
Karkoot
(2008)
No of CNPs
and VNPs
48,96 48,48 48,96 81
No. of
quantization
bits
6 4 7 7
Code length
and code
rate
2304 2304 1536
Regular
628,1296,
1944
Regular,Irregular
Application WiMAX
IEEE802.16e
standard
WiMAX
IEEE802.16
e standard
IEEE802.11n
Decoding
Algorithm
TPMP-
Minsum
TPMP-
Modified
Minsum
Modifie
d
Minsum
Layered
MMS
Slice registers 11493 2024 3455 12368
Slices - 3141 9881 11328
LUTs 18302 9547 18174 17104
BRAMs 160 87 66 87
Memory (bits) 87532 20736 NA
Clock cycles 54 78 - -
Clock (MHz) 221 144 211
Throughput
(Mbps)
472
(source datarate)
266
(source datarate)
397
(datarate)
Code
length1296
1944
56 84
FPGA Virtex V Virtex V Virtex4