A Scalable Turbo Decoding Algorithm for High-Throughput ...high-throughput turbo decoders. These efforts have fo-cused on optimizing the NoC architecture to suit the Log-BCJR algorithm

IEEE ACCESS 1

A Scalable Turbo Decoding Algorithm forHigh-Throughput Network-on-Chip ImplementationRa’ed Al-Dujaily, An Li, Robert G. Maunder, Terrence Mak, Bashir M. Al-Hashimi, and Lajos Hanzo

Southampton Wireless, ECS, University of Southampton, Southampton SO17 1BJ, U.K.

Abstract—Wireless communication at near-capacitytransmission throughputs is facilitated by employing so-phisticated Error Correction Codes (ECCs), such as turbocodes. However, real-time communication at high trans-mission throughputs is only possible if the challenge ofimplementing turbo decoders having equally high pro-cessing throughputs can be overcome. Furthermore, inmany applications, turbo decoders are required to havethe flexibility of supporting a wide variety of turbo codeparametrizations. This motivates the implementation ofturbo decoders using Networks-on-Chip (NoCs), whichfacilitate flexible and high-throughput parallel process-ing. However, turbo decoders conventionally operate onthe basis of the Logarithmic Bahl-Cocke-Jelinek-Raviv(Log-BCJR) algorithm, which has an inherently-serialnature, owing to its data dependencies. This limits theexploitation of the NoC’s computing resources, particularlyas the size of the NoC is scaled up. Motivated by this, wepropose a novel turbo decoder algorithm, which eliminatesthe data dependencies of the Log-BCJR algorithm andtherefore has an inherently-parallel nature. We show thatby jointly optimizing the proposed algorithm with theNoC architecture, a significantly improved utility of theavailable computing resources is achieved. Owing to this,our proposed turbo decoder achieves a factor of upto 2.13 higher processing throughput than a Log-BCJRbenchmarker.

Index Terms—Turbo codes, BCJR, network-on-chip,performance evaluation

LIST OF ACRONYMS

3GPP 3rd Generation Partnership ProjectASIC Application Specific Integrated CircuitASIP Application-Specific Instruction-set Pro-

cessorAWGN Additive White Gaussian NoiseBCJR Bahl-Cocke-Jelinek-RavivBER Bit Error RateBPSK Binary Phase Shift Keying

The financial support of the EPSRC, Swindon UK under the grantsEP/J015520/1, EP/L010550/1 and the TSB, Swindon UK under thegrant TS/L009390/1 is gratefully acknowledged. The research datafor this paper is available at http://eprints.soton.ac.uk/397738/.

CC Clock CycleDOR Dimension-Ordered RoutingECC Error Correction CodeFIFO First In First OutFPGA Field Programmable Gate ArrayGPU Graphical Processing UnitIC Integrated CircuitIP Intellectual PropertyLDPC Low-Density Parity-CheckLLR Log-Likelihood RatioLog-BCJR Logarithmic Bahl-Cocke-Jelinek-RavivLTE Long-Term EvolutionNI Network InterfaceNoC Network on ChipRA Repeat AccumulateWiMAX Worldwide Interoperability for

Microwave Access

LIST OF SYMBOLS

N Number of bits in each message frameM Number of trellis statesP Number of windows in each frameK Number of bits in each windowI Number of iterationsPr ProbabilityEb/N0 Normalized signal to noise ratio per bitEb Energy per bitN0 Noise power spectral densityb1 Frame of message bitsb2 Frame of parity bitsb3 Frame of systematic bitsb1 Frame of received message soft bitsb2 Frame of received parity soft bitsb3 Frame of received systematic soft bitsαk Forward state metricsβk Backward state metricsγk A priori branch metricsδk A posteriori branch metricsπ Interleaverπ−1 De-interleaver

http://eprints.soton.ac.uk/397738/

IEEE ACCESS 2

a Superscript denoting a priori informa-tion

e Superscript denoting extrinsic informa-tion

u Superscript denoting upper decoderl Superscript denoting lower decoderp Superscript denoting a posteriori infor-

mationk Subscript denoting the bit index in a

frame

I. INTRODUCTION

During the last two decades, wireless communicationhas been revolutionized by turbo codes [1] and other iter-ative ECCs, such as Low Density Parity Check (LDPC)[2] and Repeat Accumulate (RA) codes [3]. Like theseother iterative ECCs, turbo codes provide resilience tothe transmission errors that are caused by noise, interfer-ence and fading during wireless communication. This isachieved by using a turbo encoder to process each frameof message bits before their transmission, and then em-ploying a corresponding turbo decoder in the receiver todetect and correct transmission errors. In contrast to non-iterative ECCs, the error correction capability of turbocodes is so strong that they facilitate reliable communi-cation, even when employing a transmission throughputthat closely approaches the theoretical capacity of thewireless channel. However, in order to facilitate real-timecommunication with a high transmission throughput, theturbo encoder and decoder must be designed to haveequally high processing throughputs. This is a particularchallenge when designing the turbo decoder, since itscomplexity is typically much higher than that of theencoder, which is almost insignificant in comparison.Furthermore, in many wireless communication applica-tions, the design of turbo decoders with high processingthroughputs is complicated by the requirement to supportnumerous different turbo code parametrizations. For ex-ample, the turbo decoder employed by the Long TermEvolution (LTE) standard [4] for cellular telephony isrequired to support 188 different parametrizations, eachcorresponding to a different message frame length inthe range spanning from 40 to 6144 bits. Furthermore,state-of-the-art mobile devices typically support numer-ous wireless communication standards, each employingdifferent turbo code parametrizations, or different iter-ative ECCs altogether. This motivates the employmentof high-throughput Application-Specific Instruction-setProcessors (ASIPs), allowing mobile devices to flexiblyemploy the same hardware to implement various turbodecoders and other iterative ECC decoders.

In recent years, the continued scaling of integra-tion technologies has enabled the implementation ofhundreds or even thousands of ASIPs within a singleIntegrated Circuit (IC), hence facilitating flexible high-throughput processing. However, to avoid a bottleneckthe structure of interconnecting these ASIPs has to becarefully designed. This becomes a particular challengeas the amount of inter-ASIP communication is increasedbeyond the bandwidth of classical models of on-chipinterconnection, such as point-to-point and bus-basedconnections. This motivates a Network on Chip (NoC)architecture [5], which comprises a network of ASIPs,each having a router that is connected to those of theneighboring ASIPs. Communication among the ASIPs isachieved using packet-switching, whereby the messagesare dynamically routed from the source ASIP to thedestination ASIP along a path formed of interconnectedrouters. Recently, NoCs have been proposed [6]–[9] asthe basis of multi-core ICs for implementing flexible,high-throughput turbo decoders. These efforts have fo-cused on optimizing the NoC architecture to suit theLog-BCJR algorithm [10], [11], which is the basis ofa conventional turbo decoder’s operation. However, inthis paper we show that the inherently-serial nature ofthe Log-BCJR algorithm’s data dependencies results ina low utility of a NoC’s computing resources, as thenumber of ASIPs is scaled up.

Against this background, this paper proposes a novelturbo decoding algorithm, which eliminates the datadependencies of the Log-BCJR algorithm. This grantsthe turbo decoder an inherently-parallel nature, whichis better suited to NoC-implementations. In particular,we propose a novel technique for self-regulating theexchange of information within the NoC, in order toavoid congestion. More specifically, rather than adheringto rigidly-defined schedules like that of the conventionalLog-BCJR turbo decoding algorithm, the operation ofeach ASIP in the NoC is adapted in response to the de-livery of information across the NoC. Each ASIP makesthe most beneficial and timely use of the delivered infor-mation, hence maximizing the quality of the informationthat it generates for delivery over the NoC. The deliveryof that information to the connected ASIP stimulates itsoperation, causing the schedule to cascade organically,with the processing stimulating the networking and thenetworking stimulating the processing. Compared tothe classic Log-BCJR benchmarker, the proposed turbodecoder achieves a significantly improved utility of theNoC’s computing resources. Owing to this, our proposedturbo decoder achieves a significantly higher processingthroughput than the Log-BCJR benchmarker, particularlyas the number of ASIPs in the NoC is scaled up.

IEEE ACCESS 3

The rest of this paper is organized as follows. Sec-tion II reviews our previous work [12]–[15] on thefully-parallel turbo decoder, which forms the basis ofthe proposed algorithm. Following this, our novel NoC-optimized turbo decoding algorithm is proposed in Sec-tion III of this paper. Furthermore in Section IV of thispaper, we propose a NoC architecture, which we jointlyoptimize with the proposed turbo decoding algorithm.We present our simulation results in Section V, beforewe offer our conclusions in Section VI.

II. BACKGROUND

In this section, we provide a background discussionon the computations performed by the fully-parallelturbo decoding algorithm of our previous work [12]–[15], which serve as the basis of the novel NoC-optimized turbo decoding algorithm of Section III. Wecommence by introducing our notation in Section II-A.The schematic of a turbo decoder is described in Sec-tion II-B. Following this, Section II-C describes thepreprocessing invoked for initializing the turbo decoder.The proposed NoC-optimized turbo decoding algorithmof Section III operates on the basis of algorithmic blocks,which are described in Section II-D.

A. Preliminaries

In this section, we introduce the notation usedthroughout this paper when referring to the operationof the LTE turbo encoder and the turbo decoder. TheLTE turbo encoder [4] may be employed to encode aframe [bu1,k]

Nk=1 comprising N message bits, each having

a binary value bu1,k ∈ {0, 1}. Here, 188 different valuesin the range spanning from 40 to 6144 are supportedfor the frame length N . The message frame [bu1,k]

Nk=1 is

provided to an upper convolutional encoder. In response,this produces a frame [bu2,k]

Nk=1 comprising N parity bits,

as well as a frame [bu3,k]Nk=1 comprising N systematic

bits. In addition to this, the upper convolutional encoderproduces three termination message bits [bu1,k]

N+3k=N+1,

as well as three termination parity bits [bu2,k]N+3k=N+1.

Meanwhile, the message frame [bu1,k]Nk=1 is interleaved, in

order to obtain the interleaved message frame [bl1,k]Nk=1.

This is provided to a lower convolutional encoder, whichproduces a frame [bl2,k]

Nk=1 comprising N parity bits,

as well as three termination message bits [bl1,k]N+3k=N+1

and three termination parity bits [bl2,k]N+3k=N+1. Note that

the lower convolutional encoder does not produce anysystematic bits. The LTE turbo encoder has a codingrate of R = N/(3N + 12) since it outputs a total of(3N + 12) bits, namely [bu2,k]

Nk=1, [bu3,k]

Nk=1, [bl2,k]

Nk=1,

[bu1,k]N+3k=N+1, [bu2,k]

N+3k=N+1, [bl1,k]

N+3k=N+1 and [bl2,k]

N+3k=N+1.

Throughout the remainder of this paper, the superscripts‘u’ and ‘l’ are used only when necessary to explicitlydistinguish the upper and lower components of the turbocode, but they are omitted when the discussion appliesequally to both.

Following their modulation and transmission over awireless channel, the (3N + 12) turbo-encoded bitsmay be demodulated and provided to the turbo decoder.Owing to the effect of noise in the wireless channelhowever, the demodulator will be uncertain of the correctvalues for these turbo-encoded bits. Therefore, insteadof providing a hard-valued decision for each bit, thedemodulator provides soft-valued decisions, in the formof a priori Logarithmic Likelihood Ratios (LLRs). Morespecifically, the a priori LLR pertaining to the bit bj,k isdefined by

baj,k = lnPr(bj,k = 1)

Pr(bj,k = 0). (1)

In this notation, the diacritical bar indicates a soft-valued decision, while the superscript ‘a’ corresponds toa priori information. In the remainder of this paper, thealternative superscripts ‘e’ and ‘p’ are used to indicateextrinsic and a posteriori information, respectively.

B. Schematic

The algorithm proposed for the NoC implementationof the LTE turbo decoder may be characterized by theschematic of Figure 1, which is provided with (3N+12)a priori LLRs, namely [bu,a2,k]Nk=1, [bu,a3,k]Nk=1, [bl,a2,k]

Nk=1,

[bu,a1,k]N+3k=N+1, [bu,a2,k]N+3

k=N+1, [bl,a1,k]N+3k=N+1 and [bl,a2,k]

N+3k=N+1.

As shown in Figure 1, these a priori LLRs are inputto the algorithmic blocks arranged in two rows, eachcomprising (N + 3) algorithmic blocks. Here, eachalgorithmic block operates on the basis of the LTE turbocode’s state transition diagram, which is depicted inFigure 2.

Note that the first N algorithmic blocks in each row ofFigure 1 are interconnected through the interleaver. Dur-ing the turbo decoding process, these algorithmic blocksoperate as described in Section II-D. They exchangeLLRs in an iterative manner, which is suited to theimplementation of the NoC, as described in Section III.However, before commencing this iterative decodingprocess, it must initialized using a small amount ofpreprocessing. More specifically, the last three algorith-mic blocks in each row are isolated and only have tobe operated once, before the iterative decoding processcommences. This may be performed externally to theNoC, as described in Section II-C.

IEEE ACCESS 4

Interleaver

bu,a1,1 bu,a1,2 bu,a1,3 bu,a1,Nbu,e1,1 bu,e1,3 bu,e1,Nbu,e1,2

bl,a1,1 bl,a1,2 bl,a1,3 bl,a1,Nbl,e1,1 bl,e1,3 bl,e1,Nbl,e1,2

· · ·αl

1 αl2 αl

3 αlN−1αl

0 αlN

βl1 βl

2 βl3 βl

N−1βl0 βl

N

bl,a1,N+2 bl,a1,N+3bl,a1,N+1

bl,a2,N+1 bl,a2,N+2 bl,a2,N+3bl,a2,1 bl,a2,2 bl,a2,3 bl,a2,N

βlN+1 βl

N+2 βlN+3

· · ·

bu,a3,1 bu,a3,2 bu,a3,3 bu,a3,N

bu,a1,N+1 bu,a1,N+2 bu,a1,N+3

αu1 αu

2 αu3 αu

N−1αu0 αu

N

bu,a2,1 bu,a2,2 bu,a2,3 bu,a2,N bu,a2,N+1 bu,a2,N+2 bu,a2,N+3

βu1 βu

2 βu3 βu

N−1βu0 βu

NβuN+1 βu

N+2 βuN+3

Fig. 1: Schematic characterizing the proposed turbo decoding algorithm for NoC implementation.

b3(Sk−1, Sk) = 1

7 7

6 6

55

4 4

33

22

11

0

Sk−1

0

Sk

b1(Sk−1, Sk) = 0b2(Sk−1, Sk) = 0

b1(Sk−1, Sk) = 0

b1(Sk−1, Sk) = 1

b1(Sk−1, Sk) = 1b2(Sk−1, Sk) = 1

b3(Sk−1, Sk) = 0

b3(Sk−1, Sk) = 1

b2(Sk−1, Sk) = 1b3(Sk−1, Sk) = 0

b2(Sk−1, Sk) = 0

Fig. 2: State transition diagram of the LTE turbo code, depictingall legitimate transitions between the M = 8 possible values of theprevious state Sk−1 ∈ {0, 1, 2, . . . ,M − 1} and the subsequent stateSk ∈ {0, 1, 2, . . . ,M − 1}. The binary value c(Sk−1, Sk) ∈ {0, 1}indicates whether a transition between the states Sk−1 and Sk islegitimate. Each legitimate transition implies message, parity and sys-tematic bit values of b1(Sk−1, Sk) ∈ {0, 1}, b2(Sk−1, Sk) ∈ {0, 1}and b3(Sk−1, Sk) ∈ {0, 1}, respectively.

C. Preprocessing

Before iterative decoding is commenced, the last threealgorithmic blocks in each row of Figure 1 apply somepreprocessing to the LLRs [ba1,k]

N+3k=N+1 and [ba2,k]

N+3k=N+1,

which pertain to the termination bits. These LLRs areconverted into the state metrics [βk]

N+2k=N , where βk =

[βk(Sk)]M−1Sk=0 pertains to the M = 8 possible values

of the state Sk, as shown in Figure 2. This is achievedusing the conventional Log-BCJR’s backwards recursion,which commences with the operation of the algorithmicblock having the index k = N + 3, before activating theblock with index k = N + 2 and then concluding withthe operation of the block having the index k = N + 1.When operating each of these algorithmic blocks, an apriori metric is computed for each of the transitions in

Figure 2, according to

γk(Sk−1, Sk) =

2∑j=1

[bj(Sk−1, Sk) · baj,k

], (2)

where the notation bj(Sk−1, Sk) is defined in the captionof Figure 2. Following this, an extrinsic backwardsmetric is computed for each of the M = 8 possiblestates in Figure 2, according to

βk−1(Sk−1) = max*

{Sk|c(Sk−1,Sk)=1}

[γk(Sk−1, Sk) + βk(Sk)

],

(3)where the notation c(Sk−1, Sk) is defined in the captionof Figure 2. This backwards recursion is initialized us-ing βN+3 = [0,−∞,−∞, . . . ,−∞], since terminationguarantees having a final state of SN+3 = 0. Note that(3) employs the Jacobian logarithm of [16], which isdefined for two operands as

max*(δ1, δ2) = max(δ1, δ2) + ln(

1 + e−|δ1−δ2|)

(4)

and may be extended to more operands by exploitingits associative property. The above process generates thestate metrics βN of Figure 1, which are used throughoutthe iterative decoding process of Section III. Note thatthe state metrics α0 = [0,−∞,−∞, . . . ,−∞] are alsoused throughout the iterative decoding process, since theinitial state of S0 = 0 is guaranteed.

D. Algorithmic block operation

As described in Section II-B, the proposed turbodecoder algorithm operates the first N algorithmic blocksin each row according to a novel iterative decodingschedule, as will be described in Section III. Regardlessof when an algorithmic block is activated during thisschedule, its operation follows the same process, asdetailed in the following discussion.

IEEE ACCESS 5

Whenever the block in the upper row having the indexk ∈ {1, 2, 3, . . . , N} is operated during this iterativeprocess, it accepts L = 3 a priori LLRs as inputs,as shown in Figure 1. More specifically, the blockaccepts the a priori message LLR bu,a1,k that has beenmost recently provided by the interleaver, while the apriori parity and systematic LLRs bu,a2,k and bu,a3,k areaccepted from the demodulator. By contrast, the blockin the lower row having the index k ∈ {1, 2, 3, . . . , N}accepts only L = 2 a priori LLRs when it is operated,namely the a priori message and the a priori parityLLRs bl,a1,k and bl,a2,k. Furthermore, each block having theindex k ∈ {1, 2, 3, . . . , N} accepts the vector of a prioriforward state metrics αk−1 = [αk−1(Sk−1)]M−1

Sk−1=0, aswell as the vector of a priori backward state metricsβk = [βk(Sk)]

M−1Sk=0 that have been most recently pro-

vided by the neighboring algorithmic blocks. Note thatat the start of the iterative decoding process, zero valuesare employed for the above-mentioned inputs if the in-terleaver or the neighboring blocks have not yet providedany updated values. The operation of each block havingthe index k ∈ {1, 2, 3, . . . , N} is completed using theequations provided in (5) – (8). More specifically, (5) isemployed for combining the inputs, in order to obtain ana posteriori metric δ(Sk−1, Sk) for each transition in thestate transition diagram of Figure 2. These a posterioritransition metrics are then combined by (6), (7) and(8), in order to produce the vector of extrinsic forwardstate metrics αk = [αk(Sk)]

M−1Sk=0, the vector of extrinsic

backward state metrics βk−1 = [βk−1(Sk−1)]M−1Sk−1=0

and the extrinsic message LLR be1,k, respectively. Theseextrinsic state metrics are provided for the neighboringalgorithmic blocks, while the extrinsic message LLRbe1,k is provided for the interleaver π, so that it can be

delivered to a block in the other row and used as an apriori message LLR, where bl,a1,k = bu,e1,π(k).

III. TURBO DECODING ALGORITHM PROPOSED FOR

NOC IMPLEMENTATION

In this section, we propose our novel NoC-optimizedalgorithm for the implementation of the LTE turbodecoder [4]. This is achieved by scheduling the com-putations of the algorithmic blocks described in SectionII-D in a manner that is particularly suited to NoCoperation. This is possible, since the algorithmic blocksof Section II-D are not bound by a strict scheduling thatrequires their operation according to forward and back-ward recursions, as in the conventional Log-BCJR turbodecoding algorithm [17]. In our previous work [12]–[15], this property was exploited to achieve fully-paralleloperation, in which all algorithmic blocks are operated

concurrently, in every clock cycle. More specifically,while [12] introduced the fully-parallel turbo decodingalgorithm, [13]–[15] considered the implementation ofthat algorithm in Application Specific Integrated Circuit(ASIC), Field Programmable Gate Array (FPGA) andGraphical Processing Unit (GPU) applications, respec-tively. In contrast to this previous work, the novelscheduling proposed in this section is specifically de-signed for mapping onto an NoC, in which each ASIPperforms the operations of Section II-D for differentalgorithmic blocks in different Clock Cycles (CCs). Inparticular, the proposed scheduling self-regulates theexchange of information within the NoC, in order toavoid congestion, where the operation of each ASIPin the NoC is adapted in response to the arrival ofinformation delivered by the NoC. More specifically,rather than scheduling the operation of the algorithmicblocks according to strict forward and backward recur-sions, the proposed turbo decoder algorithm schedulestheir operation according to the exchange of informationwithin the NoC. Here, the operation of the algorithmicblocks generates information for delivery over the NoC,while the delivery of that information to the connectedalgorithmic blocks stimulates their operation and so on.In this way, the processing stimulates the networkingand the networking stimulates the processing, causingthe schedule to grow organically.

As described in Section II-B, the iterative decodingprocess of the proposed turbo-decoding algorithm maybe completed using an NoC, comprising an intercon-nected network of Intellectual Property (IP) cores. Here,each IP core is responsible for the operation of adifferent subset of the first N algorithmic blocks ineach row of Figure 1. We refer to these subsets aswindows, with each window comprising K number ofadjacent algorithmic blocks, as shown in Figure 3. Weassume that the IP cores are sufficiently powerful or arespecifically designed for requiring only a single clockcycle for completing the processing of (5) – (8) for asingle algorithmic block, as detailed in Section II-D.Note however that this implies that multiple algorithmicblocks within the same window cannot be operatedconcurrently. During the first (2K−1) CCs, the iterativedecoding process is initialized by performing a forwardrecursion and then a backward recursion within eachwindow of the upper row. More specifically, as shown inFigure 3(a), the forward recursion invokes the operationsof Section II-D for the jth algorithmic block in eachwindow of the upper row during the jth clock cycle,where j ∈ {1, 2, 3, . . . ,K−1}. Following this, the back-ward recursion invokes the operations of Section II-Dfor the (2K−j)th algorithmic block during the jth clock

IEEE ACCESS 6

The turbo decoding algorithm proposed for NoC implementation.

δk(Sk−1, Sk) =

L∑j=1


]+ αk−1(Sk−1) + βk(Sk) (5)

αk(Sk) =

[max*

{Sk−1|c(Sk−1,Sk)=1}

[δk(Sk−1, Sk)

]]− βk(Sk) (6)

βk−1(Sk−1) =

[max*

{Sk|c(Sk−1,Sk)=1}

[δk(Sk−1, Sk)

]]− αk−1(Sk−1) (7)

bej,k =

[max*

{(Sk−1,Sk)|bj(Sk−1,Sk)=1}

[δk(Sk−1, Sk)

]]−[

max*


[δk(Sk−1, Sk)

]]− baj,k (8)

1 ...{1}

{2K-1}2 3 4 K-1 K

{K-1} {K}

{K+1}

InterleaverAd

jace

nt

win

dow

Ad

jace

nt

win

dow

......

(a)

If ba1,kfirstis provided

in jth clock cycle.

If ba1,klastis provided

in jth clock cycle.

1 ...{j}

2 3 4 K-1 K

Interleaver

{j + 1} {j + 1} {j}

Ad

jace

nt

win

dow

Ad

jace

nt

win

dow

......

(b)

1 ...2 3 4 K-1 K

Interleaver

If ba1,k=3 is most recently provided in this window

{j} {j + 1}{j + 2}

Ad

jace

nt

win

dow

Ad

jace

nt

win

dow

......

(c)

Fig. 3: Data flow within a single window of the proposed NoC implementation for the cases of (a) performing the first half-iteration, (b)participating in the dynamically regulated iterative decoding process and (c) remaining active when the window would otherwise be idle.The clock cycle index j is shown in the curly brackets.

cycle, where j ∈ {K,K + 1,K + 2, . . . , 2K − 1}. Notethat the extrinsic LLRs bu,e1,k that are generated duringthe forward recursion are discarded and that only those

obtained during the backward recursion are providedto the interleaver. In the NoC, interleaving is achievedby appropriately routing the extrinsic LLRs between IP

IEEE ACCESS 7

cores, which may take several CCs, depending both onthe path length and on the level of congestion.

As shown in Figure 3(b), whenever the interleaving ofan extrinsic LLR is completed throughout the remainderof the iterative decoding process, the resultant a prioriLLR ba1,k is immediately processed in that jth clock cycleby the corresponding algorithmic block, as described inSection II-D. However, the resultant extrinsic LLR be1,kis not provided to the interleaver, since it is extrinsicto ba1,k and does not benefit from the corresponding apriori information. Instead, the resultant extrinsic statemetric vectors αk and βk−1 are provided for the adjacentalgorithmic blocks, having the indices of k + 1 andk − 1, respectively. In the following (j + 1)st clockcycle of Figure 3(b), one of these extrinsic state metricvectors is processed as described in Section II-D bythe corresponding adjacent algorithmic block, providedthat it resides within the same window as the blockwith index k. Here, a random selection is employed, ifboth adjacent blocks reside within the same window asthe block with index k. Finally, it is the extrinsic LLRbe1,k+1 or be1,k−1 produced by this adjacent algorithmicblock that is provided for the interleaver. In this way, thedelivery of each LLR through the interleaver stimulatesthe generation of another LLR, dynamically regulatingthe level of congestion within the NoC.

Owing to the delays imposed by the interleaver,there may be CCs in which none of the algorithmicblocks within a particular window is engaged in thedynamically-regulated process described above. Duringthese CCs, it is still beneficial to perform the operationsof Section II-D for the algorithmic blocks within thewindow, so that information may be propagated using theextrinsic state metrics. In this way, the IP core dedicatedto each window is able to perform useful computationsin every clock cycle, achieving a hardware resourceutility of 100%. Note however that the interleaver isnot provided with the extrinsic LLRs that are producedwhen operating algorithmic blocks in this way, since thiswould cause excessive congestion in the NoC. As shownin Figure 3(c), during these otherwise-unutilized CCs,priority is given to propagating the information withinthe a priori LLR ba1,k that has been most recently pro-vided by the interleaver. This is achieved by performingthe operations of Section II-D for the algorithmic blockswithin the window in order of decreasing proximityto the kth block, alternating between the blocks in theforward and backward directions. This propagation ishalted once all of the blocks in the window have beenoperated, or if the interleaver provides the window withanother a priori LLR, whereupon the above-describedprocess is restarted. When not employing otherwise-

unutilized CCs for propagating the information suppliedby a priori LLRs, these CCs may be used to propagateinformation supplied by the adjacent windows. Morespecifically, when the algorithmic blocks at the adjoiningends of the adjacent windows are operated as describedin Section II-D, they will provide updated a prioristate metrics αkfirst−1 or βklast . Here, kfirst and klast

are the indices of the first and last algorithmic blocksin the window considered. The information provided byαkfirst−1 may be propagated by using a forward recursionto perform the operations of Section II-D for the algo-rithmic blocks in the window in ascending order of indexk. Likewise, a backward recursion may be employed tooperate the algorithmic blocks in decreasing order ofindex k and propagate the information provided by βklast .These recursions are halted once all of the blocks in thewindow have been operated, or if the window is providedwith an updated a priori LLR or state metric αkfirst−1

or βklast , whereupon the corresponding above-mentionedprocess is restarted.

Following the completion of the iterative decodingprocess, the a posteriori LLRs {bu,p1,k}Nk=1 may be ob-tained according to bu,p1,k = bu,a1,k + bu,e1,k. A hard-decisionmay be imposed upon these a posteriori LLRs in orderto obtain the bits {bu1,k}Nk=1, where each bit value isobtained as the result of a binary test bu1,k = bu,p1,k > 0.

IV. THE NOC PROPOSED FOR TURBO DECODER

IMPLEMENTATION

In this section, we propose an NoC platform forimplementing the proposed scalable turbo decoding al-gorithm of Section III. We commence in Section IV-Aby providing a brief introduction to the architecture ofa typical NoC platform. Following this, Section IV-Bdescribes the topology of the tiles in our proposed NoCplatform and the mapping of the windows described inSection III onto these tiles. Finally, the NoC routingemployed for implementing the interleaver is describedin Section IV-C.

A. NoC architecture

An NoC comprises an on-chip interconnection net-work between many identical nodes, each of which isso-called a tile. A tile comprises four main components,namely an IP core, a router, Network Interfaces (NIs)and links. The IP cores refer to a block of reusable com-ponents, which may be ASIPs, ASICs, FPGAs, CPUs,DSPs or memory, for example. In order to perform inter-tile communications, a router is employed and connectedto the local IP core via an NI in each tile, while the

IEEE ACCESS 8

routers of different tiles are connected via links, whichare typically on-chip wires.

The layout pattern of the interconnections between theon-chip components is referred to as the network topol-ogy. Many topologies exist in the literature, ranging fromthe simple shared bus [18] and ring [19] topologies, tomore complicated application-specific topologies, suchas those of [20], [21]. The selection of an appropriatetopology depends on design constraints such as perfor-mance, area, power, locality of traffic and quality ofservice. Given a particular network topology, routingis responsible for choosing a path between any twocommunicating nodes, which are referred to as a source-destination pair. The careful selection of an appropriaterouting algorithm is crucial for achieving good networkperformance. An effective routing algorithm has to care-fully consider load balancing, throughput, latency, powerconsumption, fault-tolerance, deadlocks, livelocks andthe limited hardware resources available on chip [22].Although an abundance of routing algorithms have beenproposed in the literature [23], [24], they are typicallyclassified into three kinds, namely deterministic, oblivi-ous and adaptive [23].

B. Topology and Mapping onto IP Cores

As shown in Figure 4(a), we employ a two-dimensional (2D) mesh topology for implementing ourproposed turbo decoder NoC, since its regular grid-like structure is particularly suited to fabrication [25],[26]. Furthermore, the proposed turbo decoder schematicof Figure 1 is also a regular 2D structure, which cantherefore be readily mapped to an NoC employing a 2Dmesh topology. As shown in Figure 4(a), a 2D mesh NoChaving the dimensions of X by 2Y may be invoked fordecoding frames comprising N bits that are decomposedinto P number of windows, where XY = P . Accord-ingly, the unshaded tiles ∈ {1, 2, 3, . . . , XY } and theshaded tiles ∈ {XY + 1, XY + 2, XY + 3, . . . , 2XY }respectively correspond to the upper and the lower de-coders of Figure 1, wherein each tile processes one of theP windows. Note that adjacent windows of algorithmicblocks are mapped to adjacent tiles in Figure 4(a),according to a pattern that meanders from side to sidein the 2D mesh topology.

The 2D mesh is employed to interleave the extrinsicLLRs be1,k that are produced by the tiles of Figure 4(a).These are routed according to the approach of Sec-tion IV-C and delivered to the destination tile, wherethey are used as the a priori LLRs ba1,k. As describedin Section IV-A, each tile includes an IP core, whichcomprises an ASIP processor and some memory blocks,

R ...

1

IP IP IP IP

IP IP IP IP

R R R

2 3 X

X+12X-22X-12X

R ...R R R

IP IP IP IPXY -X+1XY -2XY -1XY

R ...R R R

IP IP IP IPXY+XXY+3XY+2XY+1

R ...R R R

IP IP IP IPXY+X+1Y X+2X-2XY+2X-1XY+2X

R ...R R R

IP2XY -X+12XY -22XY -12XY

R ...R R R

IP IP IP

...

...

...

...

...

...

Upper Decoder

Lower Decoder

(a)

IP

Processor

βk−1

αkfirst−1

βklast

αklast

βkfirst−1

[αk]klastk=kfirst−1

[βk]klastk=kfirst−1

Memory

[ba1,k]klastk=kfirst



Memory

ba1,k

be1,k

ba2,k ba3,k

βk

αk

αk−1

(b)

Input 1

Input 2

Input n

Output 1

Output 2

Output m

Crossbar

BufferRou"ng &

Arbitra"on

AllocatorCredit inCredit out

FIFO

buffers

(c)

Fig. 4: (a) Schematic of the proposed 2D mesh NoC, having thedimensions of X by 2Y , (b) schematic of the IP core in each tile,(c) schematic of the router in each tile.

IEEE ACCESS 9

as shown in Figure 4(b). In each clock cycle, the ASIPperforms the operations of Section II-D for a particularone of the algorithmic blocks in the corresponding win-dow, depending on the delivery of the a priori LLRs ba1,k,according to the novel schedule described in Section III.More specifically, in each unshaded tile correspondingto the upper decoder, the ASIP processor is designed toprocess (5) – (8) using L = 3, as detailed in Section II-D.As demonstrated in [13], the ASIP may complete thesecalculations for a single algorithmic block within a singleclock cycle using 75 adders (some of which are used formaximum or subtraction operations), where the criticalpath is formed by a chain of 6 adders. Furthermore, ablock of memory is employed for storing all a priorimessage LLRs bu,a1,k, all parity LLRs bu,a2,k and all a priorisystematic LLRs bu,a3,k for all algorithmic blocks in thecorresponding window. By contrast, the ASIP processorsof the shaded tiles corresponding to the lower decoderprocess (5) – (8) with only L = 2, requiring 74 adderswith a critical path comprising 6 adders. Accordingly,these tiles do not require memory for storing a priorisystematic LLRs bl,a3,k, as discussed in Section II-D. In ad-dition to these LLRs, each IP core employs another blockof memory for storing both the branch metrics δk, aswell as the forward state metrics αk and backward statemetrics βk−1, as shown in Figure 4(b). Most of these in-termediate variables are produced and consumed withinthe same window of algorithmic blocks and hence withinthe same tile. However, as described in Section III, whenthe kth = first and kth = last algorithmic blocks ina particular window are operated, the resultant sets ofM backward state metrics βkfirst−1 and M forward statemetrics αklast must be provided to the preceding windowand the following window, respectively. Compared to thesingle extrinsic LLRs that are produced at a time byeach tile, these sets of state metrics comprise M times asmuch data, where we have M = 8 in the case of the LTEturbo code. However, this data is only exchanged with aneighboring window and hence with a neighboring tile.Therefore, in order to avoid congestion on the 2D mesh,these sets of state metrics are exchanged using one-hop links between adjacent tiles in our proposed NoCimplementation, which are shown as the dotted lines inFigure 4(a) and 4(b). These single-hop links may beformed using shared registers and links, where the sizeof the registers will be specified according to the size ofthe forward and backward state metrics.

C. Routing

The router of each tile in the proposed NoC has fivebi-directional ports, which are connected to the IP core

of that tile and to the routers of its four neighboring tiles,as shown in Figure 4. These connections form the pathsthat are used for conveying the extrinsic message LLRsbe1,k between the tiles associated with the upper and thelower decoders of the proposed turbo decoder algorithm.Note that the tiles residing at the edges or the corners ofthe NoC mesh have fewer than five ports, as shown inFigure 4. In each clock cycle, an input may be providedto some or all of a router’s ports, and it may provide anoutput to one of its ports. More specifically, an input willbe provided on the port connected to the associated IPcore, if it generates an extrinsic extrinsic message LLRbe1,k during the clock cycle, as described in Section III.Likewise, an input will be provided on a port connectedto a neighboring router, if it provides an output on itscorresponding port, as will be detailed below. As shownin Figure 4(c), an input provided on a particular portis passed to a corresponding First In First Out (FIFO)buffer, containing four memory elements, where thisnumber was found to offer a desirable tradeoff betweenNoC performance and hardware requirement [22], [27].If the FIFO buffer is currently empty, then this input willbecome immediately available at its output. By contrast,if the FIFO buffer is partially full, then the input is storedin the specific memory element representing the end ofthe queue and the value stored at the start of the queue ismade available at the output of the FIFO buffer. Finally,if the FIFO buffer is totally full, then the input is rejectedand the credit out signal of Figure 4(c) is asserted, inorder to inform the source of the input. Following this,the crossbar switch of Figure 4(c) is directed by therouting algorithm to be discussed below for selectingthe value at the front of the queue in a particular oneof the FIFO buffers. The routing algorithm also directsthe crossbar switch to clock the selected value into theoutput buffer of a particular one of the ports, ready to beprovided in the next clock to the associated IP core or aneighboring router, as appropriate. However, if the creditin signal of Figure 4(c) is asserted by that IP core orneighboring router, then the output buffer is not updated,so that another attempt can be made to pass the rejectedvalue in the next clock cycle.

The router of Figure 4(c) employs round-robin ar-bitration, to direct the crossbar switch to select whichinput FIFO buffer should provide the output of the routerin each clock cycle. In successive CCs, round-robinarbitration rotates through each input FIFO buffer in turn,skipping over any that are empty. Furthermore, the routerof Figure 4(c) employs the deterministic Dimension-Ordered Routing (DOR) routing algorithm of [27], whichis also known as XY routing for the 2D mesh topology.As suggested by the terminology, the DOR algorithm

IEEE ACCESS 10

transfers the LLRs from the source tile to the destinationtile in a dimension-by-dimension manner, which readilyguarantees freedom from deadlock. More specifically, ifthe destination tile of an LLR has a different X co-ordinate to that of its current location, then it is routed inthe appropriate X direction. Otherwise, it is routed in theappropriate Y direction, until it reaches its destination.Since the proposed scalable turbo decoder has the fea-ture of having self-regulated network traffic, [27], [28]showed that the DOR routing algorithm can outperformthe most sophisticated-adaptive routing algorithms, suchas negative-first [29], odd-even [30] and DyAD [31],[32].

V. RESULTS AND EVALUATION

In this section, we compare the proposed NoC turbodecoder with the application of the conventional Log-BCJR turbo decoder to an NoC. We commence inSection V-A by discussing the differences between theNoC implementation of the proposed turbo decodingalgorithm and the conventional Log-BCJR decodingalgorithm. Following this, we describe the simulationtools and methodology used for their comparison inSection V-B. Finally, we present the results of ourexperiments in Section V-C.

A. Implementation of the Benchmarker

In order to serve as a benchmarker for the pro-posed NoC turbo decoder, a conventional LTE turbodecoder employing the Log-BCJR algorithm with thePIVI windowing technique [33] was also applied tothe proposed NoC of Figure 4. More specifically, thisconventional turbo decoder was implemented using thesame topology and mapping as the proposed parallelturbo decoding algorithm, where P tiles are used foreach of the upper and lower decoder, with each tileallocated to the processing of one of the P windowsin a frame. This approach facilitates a direct and faircomparison, since this topology and mapping does notparticularly benefit or impede either algorithm. In theLog-BCJR benchmarker, the IP core in each tile iscapable of processing one of the K = N/P bits inthe corresponding window per clock cycle. However,in contrast to the proposed turbo decoding algorithm,here the ASIP processor in the IP core is designed toprocess the conventional Log-BCJR algorithm of (9) –(13), where we have L = 3 for the upper decoder andL = 2 for the lower decoder. Accordingly, each IP coreschedules the processing of the bits in the correspondingwindow across several consecutive CCs, using a con-ventional Log-BCJR’s forward recursion, followed by

a backward recursion. More explicitly, during the first(K − 1) CCs of the turbo decoding process, each tilecorresponding to the upper decoder performs a forwardrecursion, in which (9) and (10) are performed for thejth bit in the window during the jth clock cycle, wherej ∈ {1, 2, 3, . . . ,K − 1}. During the next K CCs,each tile corresponding to the upper decoder performs abackward recursion, in which (11) – (13) are performedfor the (2K − j)th bit in the jth clock cycle, wherej ∈ {K,K + 1,K + 2, . . . , 2K − 1}. As and when theextrinsic LLRs bu,e1,k are obtained during the backwardrecursion, they are routed across the NoC to the tilesof the lower decoder, which may take several CCs. Atthe corresponding tiles of the lower decoder, these LLRsbecome the a priori message LLRs bl,a1,k according tothe interleaver π, where bl,a1,k = bu,e1,π(k). A tile of thelower decoder may begin its forward recursion as soonas the a priori message LLR corresponding to the firstbit in the window has been received. However, oncethe forward recursion has started, it may stall whilewaiting for each successive a priori message LLR tobe received, owing to the data dependencies of the Log-BCJR algorithm. Once a tile of the lower decoder hascompleted its forward recursion, it may perform thebackward recursion during the next K successive CCs, inorder to generate extrinsic LLRs to be routed across theNoC. When these LLRs arrive at the corresponding tilesof the upper decoder of Figure 4(a), they become a priorimessage LLRs. The rest of the iterative decoding processcontinues in this manner, where stalls may be incurredduring the forward recursions of the tiles correspondingto both the upper and lower decoder.

Note that since the tiles process the windows inde-pendently of each other, the processing of one windowmay begin earlier than another. However, the a priorimessage LLRs corresponding to a particular windowin either the upper or the lower decoder are typicallyprovided from many different windows in the otherdecoder, owing to the action of the interleaver. Thistends to approximately synchronize the processing ofthe windows in the benchmarker. Owing to this, thethroughput of the benchmarker tends to be limited bythe longest routes through the NoC, which cause stalls inthe forward recursion processing performed throughoutthe NoC. Note that this bottleneck affecting the bench-marker’s processing efficiency is imposed by the datadependencies of the Log-BCJR algorithm, which areeliminated in the proposed NoC-optimized algorithm ofSection III.

IEEE ACCESS 11

The Log-BCJR turbo decoding algorithm.

γk(Sk−1, Sk) =

L∑j=1


](9)

αk(Sk) = max*

{Sk−1|c(Sk−1,Sk)=1}[γk(Sk−1, Sk) + αk−1(Sk−1)] (10)

βk−1(Sk−1) = max*

{Sk|c(Sk−1,Sk)=1}

[γk(Sk−1, Sk) + βk(Sk)

](11)

δk(Sk−1, Sk) = γk(Sk−1, Sk) + αk−1(Sk−1) + βk(Sk) (12)

bej,k =

[max*


[δk(Sk−1, Sk)

]]−[

max*


[δk(Sk−1, Sk)

]]− baj,k (13)

NoC's cycle

accurate simulator

(SystemC)

BER

simulator

(Matlab)

- size- topology- routing algorithm- simulation time- No. of ports- No. of buffers- channel size- etc.

Results

Decoders' errorcorrectionperformance

Fig. 5: The tool-chain used to evaluate the proposed NoC-optimizedturbo decoding algorithm.

B. Evaluation Methodology

The performance comparison of Section V-C wasperformed using the tool-chain of Figure 5. Here,Noxim [34] simulator was modified to simulate the traf-fic that is produced by the proposed NoC. More specif-ically, Noxim was configured to simulate the schedulesused by the proposed NoC-optimized turbo decodingalgorithm of Section III and by the benchmarker Log-BCJR algorithm of Section V-A. In this way, Noximbecame able to identify, which specific extrinsic LLRsare generated by each tile in each clock cycle. Morespecifically, by simulating the routing of these LLRsacross the NoC, Noxim could also identify which apriori LLRs are delivered to each tile in each clockcycle and then apply the turbo decoding schedule fordetermining which extrinsic LLRs are produced in re-

sponse and when. Note that while Noxim has awarenessof when the LLRs are generated and delivered, it doesnot have awareness of the specific value of these LLRs,since it does not simulate the computations of the turbodecoding algorithms of Sections III and V-A. Instead,the traffic traces generated by Noxim were provided forMatlab simulations of the algorithms, which were pro-grammed to simulate the computations of the algorithmswith consideration of when the a priori LLRs becomeavailable at each tile. These Matlab simulations wererun repeatedly in a Monte Carlo manner, in order tocharacterize the Bit Error Ratio (BER) performance ofthe proposed NoC-optimized turbo decoding algorithm,as well as of the benchmarker Log-BCJR algorithm.More specifically, after each iteration of each run ofeach algorithm, our simulation recorded the number ofdecoding errors observed and the number of CCs thathave elapsed so far. By averaging across the differentruns of the algorithms, the BER may be characterizedas functions of the number of iterations performed, orequivalently of the number of CCs performed. During theoperation of the proposed NoC-optimized turbo decodingalgorithm of Section III, the number of ‘equivalent itera-tions’ is given by dividing the total number of generatedextrinsic LLRs by 2N , noting that some tiles may havegenerated more extrinsic LLRs than others. In the case ofthe benchmarker Log-BCJR algorithm of Section V-A,an iteration is completed whenever a complete set of 2Nextrinsic LLRs is generated.

As shown in Figure 5, our Noxim simulations areparametrized by the NoC topology, size, and methodsof arbitration and routing, which were configured as

IEEE ACCESS 12

described in Section IV. Furthermore, our Noxim andMatlab simulations are both parametrized by the framelength N , the window size K and the interleaver designπ, as shown in Figure 5. Here, the standard LTE inter-leaver designs were employed for all simulations, whilethe effects of the parameters N and K are investigatedin Section V-C.

C. Results and Discussions

In this section, we discuss the results of the compar-ison between the proposed NoC turbo decoder and theapplication of the conventional Log-BCJR turbo decoderto an NoC. The hardware resource utility achieved byboth implementations is discussed in Section V-C1,while their error correction capability is discussed inSection V-C2.

Benchmarker number of iterations, I1 2 4 8 16 32 64

Clock

cycles

103

104

105

106

N=512, K=4

N=512, K=16

N=512, K=64

N=6144, K=48

N=6144, K=192

N=6144, K=768

(a)

Benchmarker number of iterations, I0 5 10 15 20 25 30 35

Proposedequivalentiterations

0

20

40

60

80

100

N=512, K=4

N=512, K=16

N=512, K=64

N=6144, K=48

N=6144, K=192

N=6144, K=768

(b)

Fig. 6: Cycle-accurate Noxim simulation results of routing in theproposed and benchmarker NoC implementations of the LTE turbodecoder. (a) Number of CCs required to complete different numbersof decoding iterations I by the benchmarker employing differentframe lengths N and window sizes K. (b) Comparison between thenumber of decoding iterations completed by the benchmarker and thenumber of equivalent decoding iterations completed by the proposedimplementation, in the same number of CCs.

1) Hardware resource utility: Figure 6(a) character-izes the number of CCs required by the benchmarkerNoC implementation of the LTE turbo decoder forperforming different numbers of decoding iterations I ,when decoding frames having different frame lengths Nand window sizes K. As may be expected, the numberof CCs required for the benchmarker to complete eachiteration grows with the window size K. However, therelationship is not entirely linear, since a larger NoC sizetypically results in heavier inter-tile traffic, leading to agreater prevalence of stalls, requiring more CCs to com-plete each iteration than might otherwise be expected.Note that the NoC size employed is in accordance withthe ratio of the frame length N and the window size K,as shown in Table I. Owing to this, Table I shows thata small-size NoC facilitates a higher hardware resourceutility for the benchmarker than a large-size NoC fora given frame length N . Here, the utility is quantifiedas CCs required per I

CCs used per I × 100%, where the number of CCsused per I is obtained using the experimental results ofFigure 6(a), while the number of CCs required per I maybe obtained as 4K, since a single iteration comprises twohalf-iterations, which includes a K-clock-cycle forwardrecursion and a K-clock-cycle backward recursion. Notethat although a small-size NoC implementation achievesa better hardware resource utility than a large-size NoCimplementation, the required overall CCs per iteration ofFigure 6(a) is typically worse, owing to the employmentof a large window size K.

By contrast, our proposed NoC implementationachieves a constant hardware resource utility of 100% allthe time, regardless of the frame length N , the windowsize K and the NoC size. This is due to its eliminationof the Log-BCJR data dependencies, allowing it to avoidthe stalls that are incurred by the benchmarker. As shownin Figure 6(b), this advantage of the proposed NoCimplementation allows it to complete 1.8 to 2.5 timesas many equivalent iterations as the benchmarker inthe same number of CCs, when implementing the LTEturbo decoder. Note that this performance gain for theproposed NoC implementation is related more to theNoC size rather than to the window size K, as in thebenchmarker implementation. This is because the datadependencies within a window have been eliminated inthe proposed NoC implementation.

2) Error correction capability: Figures 7 and 8 char-acterize the BER performance of the proposed andbenchmarker NoC implementations of the LTE turbodecoder, when employing frame lengths of N = 512and N = 6144, respectively. In each case, the BERperformance is plotted for various window sizes Kand for the numbers of CCs that are required for the

IEEE ACCESS 13

TABLE I: A comparison between the proposed and benchmarker NoC implementations for different configurations, in terms of hardwareresource utility, error correction capability and throughput. Here, the hardware resource utility of the benchmarker is defined as CCs required per I

CCs used per I ×100%, the error correction capability is represented by the Eb/N0 value where a BER of 10−4 is achieved and the throughput is indicatedby the CCs used in order to achieve the BER of 10−4. In each case, BPSK modulation is employed for communication over an AWGNchannel.

Configurations Benchmarker resource utility BER and throughputFramelengthN

Windowsize K

No. ofcores2P

NoCsize

CCs re-quiredper I

CCsusedper I

Resourceutility(%)

Eb/N0

(dB)(BER=10−4)

CCs forbench-marker

CCs forproposed

Throughputgain (%)

4 256 16×16 16 150 10.7 2.61 3981 1271 213512 16 64 8×8 64 292 21.9 2.61 3744 1825 105

64 16 4×4 256 669 38.3 2.61 5567 4261 30.716 256 16×16 64 466 13.7 1.78 3647 2442 49.3

2048 64 64 8×8 256 1146 22.3 1.78 5049 3528 43.1256 16 4×4 1024 2694 38.0 1.78 20862 18022 15.848 256 16×16 192 1620 11.9 1.47 21829 15421 41.6

6144 192 64 8×8 768 3611 21.3 1.47 36824 26084 41.2768 16 4×4 3072 8438 36.4 1.47 85143 76713 11.0

benchmarker to complete I ∈ {2, 4, 8, 16, 32} iterations,when using BPSK modulation for communication overan AWGN channel. After employing a sufficiently highnumber of CCs in each case, the NoC implementationscan be seen to converge towards the BER performance ofa conventional Log-BCJR LTE turbo decoder employingonly a single window of length K = N and employingI = 50 decoding iterations, hence validating the opera-tion of the NoC.

In all cases, Figures 7 and 8 show that within anyparticular number of CCs, the proposed implementationachieves a superior BER. In the case where N = 512and K = 4, the proposed implementation using 1300CCs achieves a superior BER to the benchmarker using2100 CCs, as shown in Figure 7(a). This very significantperformance gain may be attributed to the avoidance ofstalls in the proposed NoC implementation, which allowsit to complete around 4 equivalent iterations within thenumber of CCs required for the benchmarker to completeI = 2 iterations, as shown in Figure 6(b). Furthermore,as described in Section III, the proposed implementationachieves a higher hardware resource utility than thebenchmarker, with the result that each of its equivalentdecoding iteration has a greater benefit to the BER thaneach iteration of the benchmarker. Note however that thegain demonstrated by the proposed implementation forN = 512 and K = 4 diminishes as N or K is increased.For example, in the case where N = 6144 and K = 192,the proposed implementation using 30000 CCs achievesa similar BER to the benchmarker using 58000 CCs, asshown in Figure 8(b).

Note that in conventional PIVI-based Log-BCJR LTEturbo decoders, it is necessary to synchronize the pro-cessing within the algorithmic blocks in order to avoiderror correction performance degradation. However, in a

NoC implementation, the requirement to synchronize theprocessing of the algorithmic blocks causes stalls, whichtemporarily prevent decoding progress from being madeand which lead to a poor resource utility. By contrast, thenovel scheduling of the proposed NoC implementationeliminates stalls, by removing the requirement to syn-chronize the processing of the algorithmic blocks. Thisallows the proposed implementation to maintain 100%resource utility and to make more decoding progresswithin a particular amount of time, leading to the BERgains shown in Figures 7 and 8.

Table I quantifies the average number of CCs requiredby the proposed and benchmarker NoC implementationsof the LTE turbo decoder to achieve a BER of 10−4 atthe Eb/N0 value, where this achieved by the Log-BCJRturbo decoder employing I = 8 decoding iterations. Forall combinations of the frame length N and the windowsize K considered, the proposed NoC implementationcan be seen to require significantly fewer CCs thanthe benchmarkers, for the reasons discussed above. Thethroughput of these NoC implementations is proportionalto the reciprocal of the number of CCs that they employ.In accordance with this, Table I quantifies the throughputgain that is offered by the proposed NoC implementationover the benchmarker. This throughput gain is as high as213% for the case where N = 512 and K = 4. For mostother combinations of N and K, the throughput gainsoffered by the proposed NoC implementation is severaltens of percent.

VI. CONCLUSIONS

Owing to their programmable arrangement of high-performance IP cores, NoCs hold the promise of facili-tating turbo decoders that can flexibly support differentframe lengths with high throughputs. In this paper,

IEEE ACCESS 14

Eb/N0 (dB)0.5 1 1.5 2 2.5 3 3.5

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Log-BCJR (K=N), I = 50Log-BCJR (PIVI),CCs ∈ {1300, 1500, 2100, 3200, 5300}FP, CCs ∈ {1300, 1500, 2100}

(a)

Eb/N0 (dB)0.5 1 1.5 2 2.5 3 3.5

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100


(b)

Eb/N0 (dB)0.5 1 1.5 2 2.5 3 3.5

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100


(c)

Fig. 7: BER performance achieved by the proposed Fully-Parallel(FP) and benchmarker Log-BCJR (PIVI) NoC implementations ofthe LTE turbo decoder, when employing a frame length of N = 512,window sizes of (a) K = 4, (b) K = 16 and (c) K = 64, as well asthe numbers of CCs that are required for the benchmarker to completevarious numbers of decoding iterations I ∈ {2, 4, 8, 16, 32}. TheBER performance of a non-windowing Log-BCJR (K = N ) LTEturbo decoder is also provided for the case of employing the sameframe length of N = 512 and I = 50 iterations. In each case, BPSKmodulation over an AWGN channel is assumed.

we have proposed a novel NoC implementation of theLTE turbo decoder. This implementation is based upon

Eb/N0 (dB)0.8 1 1.2 1.4 1.6 1.8

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Log-BCJR (K=N), I = 50Log-BCJR (PIVI),CCs ∈ {4200, 7400, 14000, 27000}FP, CCs ∈ {4200, 7400, 14000}

(a)

Eb/N0 (dB)0.8 1 1.2 1.4 1.6 1.8

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100


(b)

Eb/N0 (dB)0.8 1 1.2 1.4 1.6 1.8

BER

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100


(c)

Fig. 8: BER performance achieved by the proposed FP and bench-marker Log-BCJR (PIVI) NoC implementations of the LTE turbodecoder, when employing a frame length of N = 6144, windowsizes of (a) K = 48, (b) K = 192 and (c) K = 768, as well as thenumbers of CCs that are required for the benchmarker to completevarious numbers of decoding iterations I ∈ {2, 4, 8, 16}. The BERperformance of a non-windowing Log-BCJR (K = N ) LTE turbodecoder is also provided for the case of employing the same framelength of N = 6144 and I = 50 iterations. In each case, BPSKmodulation over an AWGN channel is assumed.

a novel algorithm that dispenses with the serial data-dependencies of the Log-BCJR algorithm, which other-

IEEE ACCESS 15

wise cause frequent stalls in NoC implementations. Byavoiding these stalls, our proposed NoC implementationachieves a significantly higher hardware resource utilitythan a benchmarker NoC implementation based on theLog-BCJR algorithm. This facilitates significant through-put improvements for a wide variety of frame lengthsN and window sizes K, including a throughput gain of213% for the case of decoding an N = 512-bit frameusing a NoC comprising 16×16 IP cores. Our futurework will consider the implementation of the proposedNoC in silicon.

REFERENCES

[1] M. Brejza, L. Li, R. Maunder, B. Al-Hashimi, C. Berrou,and L. Hanzo, “20 Years of Turbo Coding and Energy-Aware Design Guidelines for Energy-Constrained WirelessApplications,” IEEE Commun. Surv. Tutorials, vol. PP, no. 99,pp. 1–1, jun 2015. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7131434

[2] P. Hailes, L. Xu, R. Maunder, B. Al-Hashimi, andL. Hanzo, “A Survey of FPGA-based LDPC Decoders,”IEEE Commun. Surv. Tutorials, no. c, pp. 1–1, 2015.[Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7360870

[3] A. Abbasfar, D. Divsalar, and Kung Yao, “Accumulaterepeat accumulate coded modulation,” in IEEE MILCOM2004. Mil. Commun. Conf. 2004., vol. 1. IEEE, 2004, pp.169–174. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1493264

[4] ETSI, LTE; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and Channel Coding, ETSI Std. v11.1.0,feb 2013.

[5] L. Benini and G. De Micheli, “Networks on chips: a new SoCparadigm,” IEEE Computer, vol. 35, no. 1, pp. 70 –78, 2002.

[6] H. Moussa, “On-chip communication network for flexible mul-tiprocessor turbo decoding,” in Information and CommunicationTechnologies: From Theory to Applications, 2008. ICTTA 2008.3rd International Conference on, 2008, pp. 1–6.

[7] M. Martina and G. Masera, “Turbo noc: A framework for thedesign of network-on-chip-based turbo decoder architectures,”Circuits and Systems I: Regular Papers, IEEE Transactions on,vol. 57, no. 10, pp. 2776–2789, 2010.

[8] C. Condo, M. Martina, and G. Masera, “A network-on-chip-based turbo/LDPC decoder architecture,” in Design, AutomationTest in Europe Conference Exhibition (DATE), 2012, 2012, pp.1525–1530.

[9] C. Condo, M. Martina, and G. Masera, “VLSI implementationof a multi-mode turbo/LDPC decoder architecture,” in IEEETrans. Circuits Syst. I, vol. 20, no. 6, pp. 1441–1454, June2013.

[10] L. Li, R. G. Maunder, B. M. Al-Hashimi, and L. Hanzo,“A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor Networks,” IEEE Trans. VeryLarge Scale Integr. Syst., vol. 21, no. 1, pp. 14–22,jan 2013. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6112200

[11] P. Robertson, P. Hoeher, and E. Villebrun, “Optimal andSub-optimal Maximum a Posteriori Algorithms Suitablefor Turbo Decoding,” Eur. Trans. Telecommun., vol. 8,no. 2, pp. 119–125, mar 1997. [Online]. Available: http://doi.wiley.com/10.1002/ett.4460080202

[12] R. G. Maunder, “A Fully-Parallel Turbo Decoding Algorithm,”IEEE Trans. Commun., vol. 63, no. 8, pp. 2762–2775,aug 2015. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7137638

[13] A. Li, L. Xiang, T. Chen, R. G. Maunder, B. M. Al-Hashimi,and L. Hanzo, “VLSI implementation of fully-parallel LTEturbo decoders,” IEEE Access, vol. 4, pp. 323–346, jan 2016.[Online]. Available: http://eprints.soton.ac.uk/386016/

[14] A. Li, P. Hailes, R. G. Maunder, B. M. Al-Hashimi,and L. Hanzo, “1.5 Gbit/s FPGA implementation of afully-parallel turbo decoder designed for mission-criticalmachine-type communication applications,” IEEE Access,vol. 4, pp. 5452–5473, aug 2016. [Online]. Available:http://eprints.soton.ac.uk/399185/

[15] A. Li, R. G. Maunder, B. M. Al-Hashimi, and L. Hanzo,“Implementation of a fully-parallel turbo decoder on ageneral-purpose graphics processing unit,” IEEE Access,vol. 4, pp. 5624–5639, june 2016. [Online]. Available:http://eprints.soton.ac.uk/397525/

[16] P. Robertson, E. Villebrun, and P. Hoeher, “A Comparisonof Optimal and Sub-optimal MAP Decoding AlgorithmsOperating in the Log Domain,” in Proc. IEEE Int. Conf.Commun. ICC ’95, vol. 2. Seattle, WA, USA: IEEE, jun1995, pp. 1009–1013. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=524253

[17] M. Wu, Y. Sun, G. Wang, and J. R. Cavallaro, “Implementationof a High Throughput 3GPP Turbo Decoder on GPU,” J. SignalProcess. Syst., vol. 65, no. 2, pp. 171–183, nov 2011. [Online].Available: http://link.springer.com/10.1007/s11265-011-0617-7

[18] P. Guerrier and A. Greiner, “A generic architecture for on-chippacket-switched interconnections,” in Proc. Des. Autom. TestEur. Conf. Exhib. 2000 (Cat. No. PR00537). IEEE Comput.Soc, 2000, pp. 250–256. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=840047

[19] S. Khawam, S. Baloch, A. Pai, I. Ahmed, N. Aydin, T. Arslan,and F. Westall, “Efficient implementations of mobile videocomputations on domain-specific reconfigurable arrays,” inProc. Des. Autom. Test Eur. Conf. Exhib. DATE-04, vol. 2.IEEE, 2004, pp. 1–6. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1269064

[20] K. Srinivasan, K. Chatha, and G. Konjevod, “An automatedtechnique for topology and route generation of applicationspecific on-chip interconnection networks,” in ICCAD-2005.IEEE/ACM Int. Conf. Comput. Des. 2005., vol. 2005. IEEE,2005, pp. 231–237. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1560070

[21] T. Ahonen, D. A. Siguenza-Tortosa, H. Bin, andJ. Nurmi, “Topology optimization for application-specific networks-on-chip,” in Proc. 2004 Int. Work.Syst. Lev. interconnect Predict. - SLIP ’04. NewYork, New York, USA: ACM Press, 2004, p. 53.[Online]. Available: http://dl.acm.org/citation.cfm?id=966747.966758http://portal.acm.org/citation.cfm?doid=966747.966758

[22] R. Al-Dujaily, T. Mak, F. Xia, A. Yakovlev, and M. Palesi,“Embedded transitive closure network for runtime deadlock de-tection in networks-on-chip,” Parallel and Distributed Systems,IEEE Transactions on, vol. 23, no. 7, pp. 1205–1215, 2012.

[23] W. J. Dally and B. Towles, Principles and Practices of Inter-connection Networks. USA: Morgan Kaufmann Publishers,2004.

[24] N. Jerger and L. Peh, On-Chip Networks, ser. Synthesis lecturesin computer architecture. Morgan & Claypool Publishers,2009. [Online]. Available: http://books.google.co.uk/books?id=Nf9q6goSpssC

[25] Intel, “The 80-core Tera-scale Research Chip,” [March.

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7131434








http://doi.wiley.com/10.1002/ett.4460080202

http://doi.wiley.com/10.1002/ett.4460080202








http://link.springer.com/10.1007/s11265-011-0617-7







http://dl.acm.org/citation.cfm?id=966747.966758 http://portal.acm.org/citation.cfm?doid=966747.966758

http://dl.acm.org/citation.cfm?id=966747.966758 http://portal.acm.org/citation.cfm?doid=966747.966758

http://books.google.co.uk/books?id=Nf9q6goSpssC

http://books.google.co.uk/books?id=Nf9q6goSpssC

IEEE ACCESS 16

2, 2012]. [Online]. Available: http://techresearch.intel.com/ProjectDetails.aspx?Id=151

[26] Tilera, “Tilera company products briefs,” [Dec. 10, 2011].[Online]. Available: www.tilera.com/products/processors

[27] G. Ascia, V. Catania, M. Palesi, and D. Patti, “Implementationand analysis of a new selection strategy for adaptive routing innetworks-on-chip,” Computers, IEEE Transactions on, vol. 57,no. 6, pp. 809 –820, june 2008.

[28] W. Dally and B. Towles, Principles and Practices ofInterconnection Networks, ser. The Morgan Kaufmann Seriesin Computer Architecture and Design Series. MorganKaufmann Publishers, 2004. [Online]. Available: https://books.google.ca/books?id=oOqpcB5191sC

[29] C. Glass and L. Ni, “The turn model for adaptive routing,” inComputer Architecture, 1992. Proceedings., The 19th AnnualInternational Symposium on, 1992, pp. 278 –287.

[30] G.-M. Chiu, “The odd-even turn model for adaptive rout-ing,” Parallel and Distributed Systems, IEEE Transactions on,vol. 11, no. 7, pp. 729 –738, jul 2000.

[31] J. Hu and R. Marculescu, “DyAD - smart routing for networks-on-chip,” in In ACM/IEEE Design Automation Conference,2004, pp. 260–263.

[32] T. Mak, P. Y. K. Cheung, K.-P. Lam, and W. Luk, “AdaptiveRouting in Network-on-Chips Using a Dynamic-ProgrammingNetwork,” IEEE Trans. Ind. Electron., vol. 58, no. 8, pp.3701–3716, aug 2011. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5590290

[33] M. May, C. Neeb, and N. Wehn, “Evaluation of HighThroughput Turbo-Decoder Architectures,” in 2007 IEEEInt. Symp. Circuits Syst. IEEE, may 2007, pp. 2770–2773. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4253252

[34] F. Fazzino, M. Palesi, and D. Patti, “Noxim: Network-on-chipsimulator,” http://noxim.sourceforge.net/.

Ra’ed Al-Dujaily received both B.Sc. andM.Sc. degrees in control and computer engi-neering from the University of Technology ofBaghdad/Iraq in 1998 and 2001, respectively.He received the Ph.D. degree in the School ofElectrical, Electronic and Computer Engineer-ing, Newcastle University, UK. His researchinterests include Network-on-Chip, reconfig-urable computing, VLSI circuits design and

fault tolerance in VLSI.

An Li received his first class honors BEngdegree in Electronic Engineering from the Uni-versity of Southampton in 2011 and his MPhildegree from the University of Cambridge in2012. He is currently a PhD student in Wire-less Communication research group in the Uni-versity of Southampton. His research interestsinclude parallel turbo decoding algorithms andtheir implementations upon VLSI, FPGA and

GPGPU.

Robert G. Maunder has studied with Elec-tronics and Computer Science, University ofSouthampton, UK, since October 2000. Hewas awarded a first class honors BEng inElectronic Engineering in July 2003, as wellas a PhD in Wireless Communications in De-cember 2007. He became a lecturer in 2007and an Associated Professor in 2013. Rob’sresearch interests include joint source/channel

coding, iterative decoding, irregular coding and modulation tech-niques. For further information on this research, please refer tohttp://users.ecs.soton.ac.uk/rm.

Terrence Mak (S’05-M’09) received bothB.Eng. and M.Phil. degrees in Systems Engi-neering from the Chinese University of HongKong in 2003 and 2005, respectively, and thePh.D. degree from Imperial College Londonin 2009. He joined the School of Electrical,Electronic and Computer Engineering at New-castle University as a lecturer in 2010. Duringhis Ph.D., he worked as a Research Engineer

Intern in the VLSI group at Sun Microsystems Laboratories in MenloPark, California. He also worked as a Visiting Research Scientistin the Poons Neuroengineering Laboratory at MIT. He was therecipient of both the Croucher Foundation Scholarship and the USNavel Research Excellence in Neuroengineering in 2005. In 2008,he served as the Co-Chair of the UK Asynchronous Forum, andin March 2008 he was the Local Arrangement Chair of the FourthInternational Workshop on Applied Reconfigurable Computing. Hisresearch interests include FPGA architecture design, Network-on-Chip, reconfigurable computing and VLSI design for biomedicalapplications.

Bashir M. Al-Hashimi (M’99-SM’01-F’09)is a Professor of Computer Engineering andDean of the Faculty of Physical Sciences andEngineering at University of Southampton,UK. He is ARM Professor of Computer En-gineering and Co-Director of the ARM-ECSresearch centre. His research interests includemethods, algorithms and design automationtools for energy efficient of embedded com-

puting systems. He has published over 300 technical papers, authoredor co-authored 5 books and has graduated 31 PhD students.

Lajos Hanzo (FREng, FIEEE, FIET,Eurasip Fellow, RS Wolfson Fellow, www-mobile.ecs.soton.ac.uk) holds the Chairof Telecommunications at SouthamptonUniversity, UK. He co-authored 1500+ IEEEXplore entries, 20 IEEE Press & Wiley books,graduated 100+ PhD students and has anH-index of 59.

http://techresearch.intel.com/ProjectDetails.aspx?Id=151

http://techresearch.intel.com/ProjectDetails.aspx?Id=151

www.tilera.com/products/processors

https://books.google.ca/books?id=oOqpcB5191sC

https://books.google.ca/books?id=oOqpcB5191sC





Documents

A Scalable Turbo Decoding Algorithm for High-Throughput ...high-throughput turbo decoders. These efforts have fo-cused on optimizing the NoC architecture to suit the Log-BCJR algorithm