Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput

Tinoosh Mohsenin and Bevan M. Baas

VLSI Computation Lab, ECE DepartmentUniversity of California, Davis

Split-Row: A Reduced Complexity, High Throughput Low Density Parity Check (LDPC) Decoder Architecture

Outline

Introduction to LDPC Codes Split-Row Decoder Algorithm Error Performance Comparison Decoder Implementation Results Conclusion

Error Correction in Communication Systems

Error correction is widely used in most communication systems.

Encoder(Redundancy

Added)

Decoder(Error Detectionand Correction

Noise

Binaryinformation

Correctedinformation

Encodedinformation

Noisyinformation

LDPC Codes Applications

Standards: 10 Gigabit Ethernet (10GBASE-T): 2006 Digital Video Broadcasting (DVB-S2):2005 Next generation of WiFi and WiMAX

Problems with current LDPC decoders

Lack of enough memory bandwidth High interconnect complexity

[www.ieee802.org/3/an/ ]

Transmitter:

Receiver:

Received Image

Iteration 1 Iteration 14

Noisy Channel

Decoded Image

LDPC Coding

Modified images from [Maccay 2001]

Encoded Image

Parity bits

Performs row and column operations iteratively.

100001010

010100001

001010100

001100010

100010001

010001100

Row Processing

Co

lum

n

Pro

cess

ing

LDPC Decoding: Message Passing Algorithm

Row processing

Column processing

α

βRowprocessing

Colprocessing

Errorcorrection

Parity check

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Received information from channel

βα

Serial Decoders

One or a few row and column processing units.

Features Simple Small area Small number of memories

Disadvantages Low memory bandwidth Low throughput : 100 Kbps-

10Mbps

Mem

Row Col

Full Parallel Decoders

Row and column processors are directly mapped according to the parity check matrix

High throughput Disadvantages

Large circuit area High interconnect

complexity

5x384x32=61440

5x2048x6=61440

Row1

Row2

Row384

Col1

Col2

Col3

Col2048

Example: 2048-bit, 10GBASE-T Row weight=32, Col weight=6, quantization bit=5 139 mm2 in 0.18 µm CMOS 122,000 long inter-processor wires 1.3 Gbps

Outline

Introduction to LDPC Codes Split-Row Decoder Algorithm Error Rate Comparison Decoder Implementation Results Conclusion

Key Features of Split-Row Decoder

Row processing (dominates decoder complexity) Increased parallelism Reduced number of memory accesses Reduced processor complexity

Results: Smaller decoder area and higher utilization Lower interconnect complexity Higher throughput Simpler hardware implementation

Standard vs. Split-Row Decoder

Split-Row DecoderStandard Decoder

MemB

MemA

RowA

RowB

Sign B

Sign A

ColA

ColB

Mem

Row Col

N columnsrow weight=Wr

N/2 columnsrow weight=Wr/2

N/2 columnsrow weight=Wr/2

Split-Row Algorithm-Mathematical View

By normalizing the α values with a scale factor S<1 the error performance of Split-Row decoder is improved

'

',1',1,''

''

ijjjhjhj

ijSplitij

splitijij

signS

The magnitude part of the row processor output α, is larger for the Split-Row decoder

'

',1'',1,''

'

ijjjhjhj

ij

ijij

sign

'

',1',1,''

''

ijjjhjhj

ijSplitij

splitijij

sign

ijSplitij

Sign Magnitude

Outline

Introduction to LDPC Codes Split-Row Decoder Algorithm Error Performance Comparison Decoder Implementation Results Conclusion

0 1 2 3 4 5 6 710

-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

Eb/N0(dB)

Bit

Err

or

Pro

ba

bili

ty

MS,S=0.6MS Split-Row,S=0.4MS Split-Row,S=0.3MS Split-Row,S=0.5MS Split-Row,S=1.0

Bit Error Rate Performance Comparison

Code length: 1536 bits

Message length: 1155 bits

Row weight: 16

Column weight:4

No. of iterations:15

MS: MinSum

MS Split-Row: MinSum-

Split Row

S: Scale factor

0.6dB

0 1 2 3 4 5 6 710

-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

Eb/N0(dB)

Bit

Err

or

Pro

ba

bili

ty

MS,S=0.6MS Split-Row,S=0.4MS Split-Row,S=0.3MS Split-Row,S=0.5MS Split-Row,S=1.0

Bit Error Rate Performance Comparison

Code length: 2048 bits

Message length: 1723 bits

Row weight: 32

Column weight:6

No. of iterations:15

MS: MinSum

MS Split-Row: MinSum-

Split Row

S: Scale factor

0.3dB

Outline

Introduction to LDPC Codes Split-Row Decoder Algorithm Error Rate Comparison Decoder Implementation Results Conclusion

A Full-Parallel Decoder Implementation

LDPC code example: Code length=1536 bits Message length=770 bits Row weight=6 Col weight=3

In Split-Row decoder: Total no. of wires between

each half is 3% of total wires.

Row processors in each half are 2.7 times smaller

Each row processor in each half is connected to only 3 column processors

1536 columnsrow weight=6

768

row

sco

l wei

gh

t =

3

Row+ColLeft

Row+ColRight

Row+Col

Col1

Col2

Col3

RowA

Col4

Col5

Col6

RowB

Sign BSign A

Col1

Col2

Col5

Col6

Row

Col3

Col4

Full Parallel Decoder Architecture

0.18 µm CMOS Technology, 6M layer

Split-Row, each half includes: 768 row processors 768 column processors

1536 Input Registers

1536 Output Registers

1536 Row+1536 ColProcessors

4.7

mm

2

4.7 mm2

1536 Input Registers

1536 Output Registers

Row+ColProcessors

Left

Row+ColProcessors

Right

SignA 0

SignB 0

SignA 767

SignB 767

4.1

mm

2

4.1 mm2

Standard MinSum

Split-Row vs. Standard Decoder

1536-bit (3,6) Quasi-cyclic LDPC code No. of quantization bits is set to 5 bits per message. For throughput computation no. of decoding iterations is set to 15. Reported numbers are based on chip implementation results in 0.18 µm

Avg.

Wire length

Chip size

Clk freq.

Throughput

CAD tool P&R

Run time

(mins)

Req.

Mem

(GB)

Standard MinSum

0.224 22.1 32 3.2 320 3.9

Split-Row

(This work)

0.142 16.8 53 5.4 193 2.3

Improvement

1.58× 1.3× 1.7× 1.7× 1.65× 1.7×

(mm2) (MHz) (Gbps)(mm)

Conclusion

Split-Row decoder method provides a significant reduction in circuit area

Results in: Reduced wire interconnect complexity Increased circuit area utilization Increased speed Simpler implementation

A good tradeoff between hardware complexity and error performance

Acknowledgments

Intel Corporation UC Micro NSF Grant No. 0430090 UCD Faculty Research Grant

100001010

010100001

001010100

001100010

100010001

010001100

H

'ijj'jh'j,h,'j

'ijij 'ij

'ij

minsign

1

1MinSum:

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Initial value(received information from channel )

β

α

Message Passing (Row processing )

Message Passing (Column processing )

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Initial value

β

α

j'j

'ijjij

100001010

010100001

001010100

001100010

100010001

010001100

H

λj is the received information.

0yiif0

0yiif1Vi

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Initial value

β

α

λ1

100001010

010100001

001010100

001100010

100010001

010001100

H

α

α

y1

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Rowprocessing

Colprocessing

Errorcorrection

Parity check

Initial value

β

α

8

7

6

5

4

3

2

1

0

100001010

010100001

001010100

001100010

100010001

010001100

^

^

^

^

^

^

^

^

^

v

v

v

v

v

v

v

v

v

H= 0 (Stop decoding)

≠0 (Repeat decoding)

LDPC Codes

An LDPC code is defined by a binary matrix called parity check matrix H. Rows define parity check equations (constrains) between encoded

symbols in a code word and columns define the length of the code. V is a valid code word if H٠Vt=0 Decoder in the receiver checks if the condition H٠Vt=0 is valid. Example : Parity check matrix for (9, 5) LDPC code, row weight=4,

column weight =2:

9

8

7

6

5

4

3

2

1

100001010

010100001

001010100

001100010

100010001

010001100

v

v

v

v

v

v

v

v

v

H ≠ 0 (There is error)= 0 (There is no error)

Row and Column Processor Architecture

Col. Proc.

+

+

+

1

3

1

3

in i

1

3

in i + i

Sign ( 1)

Min1

Min2

1

3 | 3 |

| 1|

Sign( 1)

SignA

SignB

Sign( 3)| 1|

| 3 |

Sign ( 3)

Min

2

Row Proc.

Comp

in1

in2

H

L

Comp

in1

in2

H

L

Sort_3

in1in2

H

MLin3

Sort_3

in1in2

H

MLin3

In2In1

In3

In5

In4

In6

Comp

in1

in2

H

L

Min1

Min2

Row+Col Procs. left

Row+Col Procs. Right

0 1 2 3 4 5 6 7 810

-7

10-6

10-5

10-4

10-3

10-2

10-1

Eb/N0(dB)

Bit

Err

or P

roba

bilit

y

Throughput=Clk*Code length/Imax P=cfv2

L

W

d C=keWL/d

What is the critical path and how you make sure that sign is computed correctly? Answer: the critical path is the sign computation, which depends on the

other side. The statistical timing analysis in place and route reports the slowest path delay, so it will make sure that the circuit works correctly.

Why the decoder chip becomes smaller even when you make it into half? Answer: first the size and total no of col processors doesn’t change. The

main benefit comes from the row processor which gets smaller than twice. The reason is that inside row processor there are different stages of comparators and they decrease more than twice when the number of inputs reduces to half.

You mentioned the design is power efficient but you didn’t report any power numbers Answer: For this paper we didn’t get the power numbers, but it can be

estimated from the fact the major energy comes from the wires (p=1/2cf^2) and we can say it’s scaled down linearly so it’s about 58% reduction.

Are there other works close to your design?

Which applications can tolerate this error performance loss? This a very broad question. It really depends on the power budget and

how much low you want to go on ber. What is the difference between viterbi and LDPC code? What is the difference between the turbo and LDPC? If don’t know the answer: I was not involved in That part of project but from what I know …. Review the previous works If asked why the chip figure is not square? If somebody asked: the way yu proposed didn’t decrease the no of wires how

do you say that it decreases the interconncet complexity. You should notice that we are talking about long wires. Because when

there is a large no of wires conincting one

Hard decision vs. soft: In hard decision decoding each received symbol is thresholded to yield

a single received bit as input to the decoding algorithm and messages passed between variable and check nodes as single bit only In soft decision decoding, multiple bits are used to represent each received symbol and the messages passed between variable and check node

How did you compute

Documents

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput