SUBMIT TO IEEE ACCESS 1 Implementation of a Fully-Parallel ...that occur within the forward and backward recursions of the Log-BCJR algorithm, which is employed during turbo decoding

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2016.2586309, IEEE Access

SUBMIT TO IEEE ACCESS 1

Implementation of a Fully-Parallel Turbo Decoderon a General-Purpose Graphics Processing Unit

An Li, Robert G. Maunder, Bashir M. Al-Hashimi, Lajos Hanzo

Abstract—Turbo codes comprising a parallel concate-nation of upper and lower convolutional codes are widelyemployed in state-of-the-art wireless communication stan-dards, since they facilitate transmission throughputs thatclosely approach the channel capacity. However, this ne-cessitates high processing throughputs in order for theturbo code to support real-time communications. In state-of-the-art turbo code implementations, the processingthroughput is typically limited by the data dependenciesthat occur within the forward and backward recursions ofthe Log-BCJR algorithm, which is employed during turbodecoding. In contrast to the highly-serial Log-BCJR turbodecoder, we have recently proposed a novel Fully ParallelTurbo Decoder (FPTD) algorithm, which can eliminate thedata dependencies and perform fully parallel processing.In this paper, we propose an optimized FPTD algorithm,which reformulates the operation of the FPTD algorithmso that the upper and lower decoders have identicaloperation, in order to support Single Instruction MultipleData (SIMD) operation. This allows us to develop anovel General Purpose Graphics Processing Unit (GPGPU)implementation of the FPTD, which has application inSoftware-Defined Radios (SDRs) and virtualized Cloud-Radio Access Networks (C-RANs). As a benefit of its higherdegree of parallelism, we show that our FPTD improvesthe higher processing throughput of the Log-BCJR turbodecoder by between 2.3 and 9.2 times, when employinga high-specification GPGPU. However, this is achieved atthe cost of a moderate increase of the overall complexityby between 1.7 and 3.3 times.

Index Terms—fully-parallel turbo decoder, parallel pro-cessing, GPGPU computing, software defined radio, couldradio access network

I. INTRODUCTION

Channel coding has become an essential com-ponent in wireless communications, since it is ca-pable of correcting the transmission errors thatoccur when communicating over noisy channels.In particular, turbo coding [1]–[3] is a channel

The financial support of the EPSRC, Swindon UK un-der the grants EP/J015520/1, EP/L010550/1 and the TSB,Swindon UK under the grant TS/L009390/1 is gratefully ac-knowledged. The research data for this paper is available athttp://eprints.soton.ac.uk/id/eprint/385323.

coding technique that facilitates near-theoretical-limit transmission throughputs, which approach thecapacity of a wireless channel. Owing to this, turbocodes comprising a concatenation of upper andlower convolutional codes are widely employed instate-of-the-art mobile telephony standards, such asWiMAX [4] and LTE [5]. However, the processingthroughput of the turbo decoding can impose abottleneck on the transmission throughput in real-time or very throughput-demanding applications,such as flawless, high-quality video conferencing.In dedicated receiver hardware, a state-of-the-artturbo decoder Application-Specific Integrated Cir-cuits (ASICs) may be used for eliminating thebottleneck of the turbo decoding. However, thisbottleneck is a particular problem in the flexiblereceiver architectures of Software-Defined Radio(SDR) [6] and virtualized Cloud-Radio Access Net-work (C-RAN) [7], [8] systems that employ onlyprogrammable devices, such as Central Process-ing Unit (CPU) or Field-Programmable Gate Array(FPGA), which typically exhibit a limited process-ing performance capability or a high-cost. AlthoughCPUs are capable of carrying out most of the LTEand WiMAX baseband operations, they are notwell-suited to the most processor-intensive aspect,namely turbo decoding [9], [10]. Likewise, whilehigh-performance and large-size FPGAs are well-suited to the parallel processing demands of state-of-the-art turbo decoding algorithms, they are relativelyexpensive. By contrast, General-Purpose GraphicsProcessing Units (GPGPUs) offer the advantages ofhigh performance parallel processing at a low cost.Owing to this, GPGPUs have been favoured overCPUs and FPGAs as the basis of SDRs, where ahigh processing throughput at a low cost is required[11], [12]. This motivates the implementation of theturbo decoding algorithm on GPGPU, as was firstdemonstrated in [13], [14].

However, turbo decoder implementationstypically operate on the basis of the serially-




PIVI

...1st window 2nd window P th window

Backward recursion

PIVI

PIVI

PIVI

PIVI

PIVI

Backward recursion Backward recursion

N -bit frame

Forward recursion

Backward recursion

Forward recursion Forward recursion Forward recursion

Fig. 1: Schematic of the Log-BCJR turbo decoder using windowing and PIVI techniques.

oriented Logarithmic Bahl-Cocke-Jelinek-Raviv(Log-BCJR) algorithm [15]. More specifically, thisalgorithm processes the bits of the message frameusing both forward and backward recursions [16],which impose strict data dependencies and hencerequire processing, which is spread over numerousconsecutive clock cycles. In order to mitigate theinherent bottleneck that the serial nature of theLog-BCJR algorithm imposes on the achievableprocessing throughput, the above-mentionedGPGPU implementation of [13] invoke a varietyof methods for increasing the parallelism of thealgorithm. For example, the windowing techniqueof [17], [18] decomposes each frame of N bits intoP equal-length windows, giving a window lengthof W = N

P, as shown in Figure 1. The processing

throughput may be increased by a factor of Pupon processing the windows concurrently, eachusing separate forwards and backwards recursions.Here, the Previous Iteration Value Initialization(PIVI) technique of [17], [18] may be employedfor allowing the adjacent windows to assist eachothers’ operation. However, the error correctioncapability of the PIVI Log-BCJR turbo decoder isdegraded as the number P of windows is increased.For this reason, the maximum number of windowsemployed in previous GPGPU implementations ofthe LTE turbo decoder associated with N = 6144was P = 192 [13], [17], [18], which avoids anysignificant error correction performance degradationand facilitates a 192-fold increase in the gradeof parallelism [19]. Furthermore, the concept oftrellis state-level parallelism may be employed

[12]. More specifically, the forward and backwardrecursions of the Log-BCJR algorithm operateson the basis of trellises comprising M states perbit [15]. Since there are no data dependenciesamongst the calculations performed for each ofthe M states, these can be performed concurrently.Since the LTE turbo decoder relies on M = 8states, the combination of the trellis state-levelparallelism and windowing facilitates a degree ofparallelism up to P ×M = 1536, occupying 1536concurrent threads on a GPGPU. However GPGPUsare typically capable of exploiting much higherdegrees of parallelism than this [20], implyingthat the existing GPGPU based turbo decoderimplementations do not exploit the full potentialfor achieving a high processing throughput.Although a higher degree of parallelism may beachieved by processing several frames in parallel[21], this would only be useful when severalframes were available for simultaneous decoding.Furthermore, the act of processing frames inparallel does not improve the processing latencyof the turbo decoder, which hence exceeds thetolerable transmission latency of many applications.

Motivated by these issues, we previously pro-posed a Fully-Parallel Turbo Decoder (FPTD) al-gorithm [22], which dispenses with the serial datadependencies of the conventional Log-BCJR turbodecoder algorithm. This enables every bit in a frameto be processed concurrently, hence achieving amuch higher degree of parallelism than the previ-ously demonstrated in the literature. Thus, the FPTDis well suited for multi-core processors [23], poten-




tially facilitating a significant processing throughputgain, relative to the state-of-the-art. However, ourprevious work of [22] considered the FPTD ata purely algorithmic level, without addressing itshardware implementation. Against this background,the contribution of this paper is follows.

1) We propose a beneficial enhancement of theFPTD algorithm of [22] so that it sup-ports Single Instruction Multiple Data (SIMD)operation and therefore it becoming bettersuited for implementation on a GPGPU. Morespecifically, we reformulate the FPTD algo-rithm so that the operations performed for theupper decoder are identical to those carriedout by the lower decoder, despite the differ-ences between the treatment of the systematicbits in the upper and lower encoders. Theproposed SIMD FPTD algorithm also requiresless high-speed memory and has a lowercomputational complexity compared to theFPTD algorithm of [22], which are desirablecharacteristics for GPGPU implementations.

2) We propose a beneficial GPGPU implemen-tation of our SIMD FPTD for the LTE turbocode, achieving a throughput of up to 18.7Mbps. Furthermore, our design overcomesa range of significant challenges related totopological mapping, data rearrangement andmemory allocation.

3) We implement a PIVI Log-BCJR LTE turbodecoder on a GPGPU as a benchmarker,achieving a throughput of up to 8.2 Mbps,while facilitating the same BER as our SIMDFPTD having a window size of N/P = 32,which is comparable to the throughputs of 6.8Mbps and 4 Mbps, demonstrated in the pairof state-of-the-art benchmarkers of [13] and[17], respectively.

4) We show that when used for implementingthe LTE turbo decoder, the proposed SIMDFPTD achieves a degree of parallelism that isbetween 4 and 24 times higher, representing aprocessing throughput improvement between2.3 to 9.2 times as well as a latency reductionbetween 2 to 8.2 times. However, this isachieved at the cost of increasing the overallcomplexity by a factor between 1.7 and 3.3.

The rest of the paper is organized as follows.Section II provides an overview of GPGPU com-

puting and its employment for the Log-BCJR turbodecoder. Section III introduces our SIMD FPTDalgorithm proposed for the implementation of theLTE turbo decoder. Section IV discusses the im-plementation of the proposed SIMD FPTD usinga GPGPU, considering topological mapping, datarearrangement and memory allocation. Section Vpresents our simulation results, including error cor-rection performance, degree of parallelism, process-ing latency, processing throughput and complexity.Finally, Section VI offers our conclusions.

II. GPU COMPUTING AND IMPLEMENTATIONS

GPUs offer a flexible throughput-oriented pro-cessing architecture, which was originally designedfor facilitating massively parallel numerical compu-tations, such as 3D image graphics [9] and physicssimulations [24]. Additionally, the GPGPU technol-ogy provides an opportunity to utilize the GPU’scapability to perform several trillion Floating-point Operations Per Second (FLOPS) for general-purpose applications, such as used for the computa-tions performed in an SDR platform. In particular,the Compute Unified Device Architecture (CUDA)[20] platform offers a software programming model,which enables programmers to efficiently exploita GPGPU’s computational units to be exploitedfor general-purpose computations. As shown inFigure 2, a programmer may specify GPGPU in-structions using CUDA kernels, which are softwaresubroutines that may be called by the host CPU andthen executed on the GPU’s computational units.CUDA manages these computational units at threelevels, corresponding to the grids, thread blocksand threads. Each call of a kernel invokes a grid,which typically comprises of many thread blocks,each of which typically comprises many threads,as shown in Figure 2. However, during the kernel’sexecution, all threads are grouped into warps, eachof which comprises 32 threads. Each warp is oper-ated in a Single Instruction Multiple-Data (SIMD)[25] fashion, with all of the 32 constituent threadsexecuting identical instructions at the same time, buton different data elements.

There are several different types of memory in aGPU, including global memory, shared memory andregisters, as shown in Figure 2. Each different typeof memory has different properties, which may bebest exploited in different circumstances in order to




Thread

Thread

block

Global memory

Sharedmemory

Grid

Device (GPU)Host (CPU)

Memory

Kernel

Reg

Thread

Thread

Thread

Thread

Reg

Reg

Reg

Reg

Thread

Thread

Thread

Thread

Reg

Reg

Reg

Reg

Kernel

KernelSerialexecution

Thread

Thread

Thread

Reg

Reg

Reg

Sharedmemory

Sharedmemory

Fig. 2: Schematic of GPU computing

optimize the performance of the application. Morespecifically, global memory is an off-chip mem-ory that typically has a large capacity accessiblefrom the host CPU, as well as from any threadon the GPU. Global memory is typically used forexchanging data between the CPU and the GPU,although it has the highest access latency and mostlimited bandwidth, compared to the other typesof GPU memory. By contrast, shared memory isuser-controlled on-chip cache, which has very highbandwidth (bytes/second) and extremely low accesslatency. However, shared memory has a limited ca-pacity and an access scope that is limited to a singlethread block. Owing to this, a thread in a particularthread block cannot access any shared memoryallocated to any other thread block. Furthermore,all data stored in shared memory will be releasedautomatically once the execution of the correspond-ing thread block is completed. In comparison toglobal and shared memory, registers have the largestbandwidth and the smallest access latency. However,registers have very limited capacity and their accessscope is limited to a single thread.

Considering these features, many previous re-search projects have explored the employment ofGPGPUs in SDR applications, as shown in Figure 3.Note that the GPGPU-based virtualized C-RANimplementation has not been exploited, although aC-RAN system has been implemented on the Ama-zon Elastic Compute Cloud (Amazon EC2) [48]using only CPUs. More specifically, [37] compared

2008 · · · · · ·•[26] LTE receiver[27] Massive parallel LDPC[28] Parallel belief propagation LDPC

2009 · · · · · ·• [29], [30] Parallel LDPC

2010 · · · · · ·• [31] Entire WiMAX handset[13], [14] Turbo decoder

2011 · · · · · ·•

[32] 2× 2 MIMO WiMAX transmitter andreceiver[33] FFT, QPSK demapper and IIR filter[34] DVB-T2 physical layer[21] LTE turbo decoder[35] Soft-out MIMO detector[36] DVB-S2 LDPC

2012 · · · · · ·•

[37] Software-defined radio architecture forcognitive testbeds[18] Programmable turbo decoder[11] Soft-out MIMO detector

2013 · · · · · ·• [38], [39] LTE turbo decoder[40] WiMax and WIFI LDPC

2014 · · · · · ·•

[41] FFT and FIR filtering[42] LTE transmitter and receiver[10] WiFi and WiMAX[17], [43] LTE turbo decoder[44], [45] Maximum a posteriori algorithm[46] LDPC block codes and LDPC

convolutional codes

2015 · · · · · ·• [12] LTE base station[47] Parallel LDPC

Fig. 3: Selected implementations of GPU based SDRs, where theimplementations of an entire transmitting system is colored in red,whilst the implementations focusing on a particular application iscolored in black.




several different SDR implementation approaches interms of programmability, flexibility, energy con-sumption and computing power. In particular, [37]recommended the employment of GPGPU as aco-processor to complement an ASIC, FPGA orDigital Signal Processor (DSP). Additionally, [33]characterized the performance of GPGPUs, whenemployed for three different operations, namelyFast Fourier Transform (FFT), Quadrature PhaseShift Keying (QPSK) demapper and Infinite ImpulseResponse (IIR) filter. Similarly, [41] compared theprocessing throughput and energy efficiency of aparticular FPGA and a particular GPGPU, whenimplementing both the FFT and a Finite ImpulseResponse (FIR) filter.

As shown in Figure 3, [12], [26], [31], [32],[42] implemented an entire transmitter, receiver ortransceiver for the LTE or WiMAX standard on aSDR platform that employs GPGPUs. Additionally,[11], [35] implemented a soft-output Multiple-InputMultiple-Output (MIMO) detector, while [34] im-plemented the Digital Video Broadcasting (DVB)physical layer on a GPGPU. All of these previousresearch efforts demonstrated that GPGPUs offeran improved processing throughput, compared tothe family of implementations using only a CPU.Furthermore, [12] showed that an LTE base stationsupporting a peak data rate of 150 Mbps can beimplemented using four NVIDIA GTX 680 GPUs,achieving a similar energy efficiency to a partic-ular dedicated LTE baseband hardware. However,[12], [42] demonstrated that turbo decoding is themost processor-intensive operation of basestationprocessing, requiring at least 64% of the processingresources used for receiving a message frame, wherethe remaining 36% includes the FFT, demapping,demodulation and other operations. Motivated bythis, a number of previous research efforts [13],[14], [17], [18], [21], [38], [39], [43]–[45] have pro-posed GPGPU implementations dedicated to turbodecoding, as shown in Figure 3. Additionally, theauthors of [27]–[30], [36], [40], [46], [47] have pro-posed GPGPU implementations of LDPC decoders.

III. SINGLE-INSTRUCTION-MULTIPLE-DATAFULLY-PARALLEL TURBO DECODER ALGORITHM

In this section, the operation of the proposedSIMD FPTD algorithm is detailed in Section III-A

and it is compared with the FPTD algorithm of [22]in Section III-B.

A. Operation of the proposed SIMD FPTD algo-rithm

In this section, we detail our proposed SIMDFPTD algorithm for the LTE turbo decoder, usingthe schematic of Figure 4(a). The correspondingturbo encoder is not illustrated in this paper, since itis identical to the conventional LTE turbo encoder[5]. As in the PIVI Log-BCJR turbo decoder, theproposed SIMD FPTD employs an upper decoderand a lower decoder, which are separated by an in-terleaver. Accordingly, Figure 4(a) shows two rowsof so-called algorithmic blocks, where the upperrow constitutes the upper decoder, while the lowerdecoder is comprised of the lower row of algorith-mic blocks. Like the PIVI Log-BCJR turbo decoder,the input to the proposed SIMD FPTD comprisesLogarithmic Likelihood Ratios (LLRs) [49], whereeach LLR b = ln[Pr(b = 1)/Pr(b = 0)] conveyssoft information pertaining to the corresponding bitb within the turbo encoder. More specifically, whendecoding frames comprising N bits, this input com-prises the six LLR vectors shown in Figure 4(a): (a)a vector [ba,u

2,k]N+3k=1 comprising N + 3 a priori parity

LLRs for the upper decoder; (b) a vector [ba,u3,k]

Nk=1

comprising N a priori systematic LLRs for theupper decoder; (c) a vector [ba,u

1,k]N+3k=N+1 comprising

three a priori message termination LLRs for theupper decoder; (d) a vector [ba,l

2,k]N+3k=1 comprising

N+3 a priori parity LLRs for the lower decoder; (e)a vector [ba,l

3,k]Nk=1 comprising N a priori systematic

LLRs for the lower decoder; (f) a vector [ba,l1,k]

N+3k=N+1

comprising three a priori message termination LLRsfor the lower decoder. Note that vectors [ba,u

2,k]N+3k=1

and [ba,l2,k]

N+3k=1 include LLRs pertaining to the three

parity termination bits of the two component codes[5]. Furthermore, the vector [ba,l

3,k]Nk=1 is not provided

by the channel, but may instead be obtained byrearranging the order of the LLRs in the vector[ba,u

3,k]Nk=1 using the interleaver π, where ba,l

3,k = ba,u3,π(k).

Moreover, as in the PIVI Log-BCJR turbo decoder,the SIMD FPTD algorithm also employs the iter-ative operation of the upper and lower decoders.As shown in Figure 4(a), these iteratively exchangevectors [be,u

1,k]Nk=1 and [be,l

1,k]Nk=1 of extrinsic LLRs

through the interleaver π, in order to obtain a priorimessage vectors [ba,u

1,k]Nk=1 and [ba,l

1,k]Nk=1 for the upper




Interleaver

...αu

0 αu1 αu

N−1 αuN

βu0 β

u1 β

u3 β

uN−1 β

uNβ

u2

ba,u3,1 ba,u3,2 ba,u3,3 ba,u3,Nba,u2,1 ba,u2,2 ba,u2,3 ba,u2,N

ba,u1,1 ba,u1,2 ba,u1,3 ba,u1,Nbe,u1,1 be,u1,2 be,u1,3 be,u1,N

αu2 αu

3

...

βuN+1

ba,u2,N+1

uN+1

βuN+2

ba,u2,N+2

uN+2

βuN+3

ba,u2,N+3

uN+3

lN+1 lN+2 lN+3

ba,l3,1 ba,l3,2 ba,l3,3 ba,l3,Nba,l2,1 ba,l2,2 ba,l2,3 ba,l2,N ba,l2,N+1 ba,l2,N+2 ba,l2,N+3

αl0 αl

1 αlN−1 αl

N

βl0 β

l1 β

l3 β

lN−1 β

lNβ

l2

αl2 αl

3

βlN+1 β

lN+2 β

lN+3

be,l1,1 be,l1,2 be,l1,3 be,l1,Nba,l1,1 ba,l1,2 ba,l1,3 ba,l1,N

ba,u1,N+1 ba,u1,N+2 ba,u1,N+3

ba,l1,N+1 ba,l1,N+2 ba,l1,N+3

βuN

βlN

u1 u3u2 uN

l1 l2 l3 lN

(a) FPTD algorithm for the case of employing an odd-even interleaver

...

Global memory

M0 MN−1M1 M2 M3 MN

be,u1,1 be,l1,2 be,u1,3 be,l1,N

...M0 MN−1M1 M2 M3 MN

u1 lNl2 u3

l1 uNu2 l3

αu0

βu0

αu1

βu1

αl1

βl1 β

l2

αl2

βu2

αu2

βu3

αu3 αl

N−1

βlN−1

αlN

βlN β

uN

βlN

αuN−1 αu

N

βuN−1 β

uNβ

l3β

l2

αl2 αl

3αu1

βu1 β

u2

αu2αl

0 αl1

βl0 β

l1

be,l1,1 be,u1,2 be,l1,3 be,u1,N

ba,u3,1 ba,u3,3ba,u2,1 ba,u2,3ba,l3,2 ba,l3,Nba,l2,2 ba,l2,N

βuN+1

ba,u1,N+1

ba,u2,N+1

uN+1

βuN+2

ba,u1,N+2

ba,u2,N+2

uN+2

βuN+3

ba,u1,N+3

ba,u2,N+3

uN+3

lN+1 lN+2 lN+3

ba,l1,N+1

ba,l2,N+1

ba,l1,N+2

ba,l2,N+2

ba,l1,N+3

ba,l2,N+3

βlN+1 β

lN+2 β

lN+3

ba,u1,1 ba,u1,3ba,l1,2 ba,l1,N

ba,l1,1 ba,l1,3ba,u1,2 ba,u1,N

ba,l3,1 ba,l3,3ba,l2,1 ba,l2,3ba,u3,2 ba,u3,Nba,u2,2 ba,u2,N

Global memory

Global memory

Execute on CPUExecute on GPU

(b) Mapping of the proposed SIMD FPTD algorithm onto the GPGPU

Fig. 4: Schematics of the proposed SIMD FPTD algorithm and its mapping for the GPGPU, where Mk represents global memory on theGPGPU device.

and lower decoders respectively [19], where ba,l1,k =

be,u1,π(k) and ba,u

1,π(k) = be,l1,k. Following the completion

of the iterative decoding process, a vector [bp1,k]

Nk=1

of a posteriori LLRs can be obtained, where bp1,k =

ba,u1,k + ba,u

3,k + be,u1,k. Throughout the remainder of this

paper, the superscripts ‘u’ and ‘l’ are used onlywhen necessary to explicitly distinguish between theupper and lower components of the turbo code andare omitted when the discussion applies equally toboth.

As in the PIVI Log-BCJR turbo decoder, theproposed SIMD FPTD algorithm employs two half-

iterations per decoder iteration. However, the twohalf-iterations do not correspond to the separate op-eration of the upper and lower decoders, like in thePIVI Log-BCJR turbo decoder. Furthermore, duringeach half-iteration, the proposed SIMD FPTD al-gorithm does not operate the algorithmic blocks ofFigure 4(a) in a serial manner, using forward andbackward recursions. Instead, the first half-iterationperforms the fully-parallel operation of the lightly-shaded algorithmic blocks of Figure 4(a) concur-rently, namely the odd-indexed blocks of the upperdecoder and the even-indexed blocks of the lowerdecoder. Furthermore, the second half-iteration per-




γtk(Sk−1, Sk) = b1(Sk−1, Sk) · ba,t−11,k + b2(Sk−1, Sk) · ba2,k + b3(Sk−1, Sk) · ba3,k (1)

αtk(Sk) = max*

{Sk−1|c(Sk−1,Sk)=1}

[γtk(Sk−1, Sk) + αt−1k−1(Sk−1)

](2)

βtk−1(Sk−1) = max*

{Sk|c(Sk−1,Sk)=1}

[γtk(Sk−1, Sk) + βt−1k (Sk)

](3)

be,t1,k =

[max*

{(Sk−1,Sk)|b1(Sk−1,Sk)=1}

[b2(Sk−1, Sk) · ba2,k + αt−1k−1(Sk−1) + βt−1k (Sk)

]]−[

max*

{(Sk−1,Sk)|b1(Sk−1,Sk)=0}

[b2(Sk−1, Sk) · ba2,k + αt−1k−1(Sk−1) + βt−1k (Sk)

]] (4)

forms the concurrent operation of the remainingdarkly-shaded algorithmic blocks of Figure 4(a), ina fully-parallel manner. This decomposition of thealgorithmic blocks into odd-even algorithmic blocksis motivated by the odd-even nature of the QuadraticPermutation Polynomial (QPP) interleaver [19] usedby the LTE turbo code and the Almost RegularPermutation (ARP) interleaver used by the WiMAXturbo code [4]. More explicitly, QPP and ARP inter-leavers only connect algorithmic blocks in the upperdecoder that have an odd index k to specific blocksthat also have an odd index in the lower decoder.Similarly, even-indexed blocks in the upper decoderare only connected to even-indexed blocks in thelower decoder. It is this fully-parallel operation ofalgorithmic blocks that yields a significantly higherdegree of parallelism than the PIVI Log-BCJR turbodecoder algorithm, as well as a significantly higherdecoding throughput. More specifically, rather thanrequiring 10s or 100s of consecutive time periodsto complete the forward and backward recursions ineach window of the PIVI Log-BCJR turbo decoder,the proposed SIMD FPTD algorithm completes eachhalf-iteration using only a single time period, duringwhich all algorithmic blocks in the correspondingset are operated concurrently. Note also that thisodd-even concurrent operation of algorithmic blocksin the upper and lower decoder represents a signif-icant difference between the FPTD algorithm anda PIVI Log-BCJR decoder employing a windowlength of W = 1, as considered in [38]. Morespecifically, a PIVI Log-BCJR decoder having awindow length of W = 1 may require as manyas I = 65 iterations to maintain a similar BERperformance as a PIVI Log-BCJR decoder having

a window length of W = 32 and I = 7 iterations[38]. By contrast, taking advantage of the odd-evenfeature our FPTD algorithm requires only I = 36iterations to achieve this, as it will be detailed inSection V-A.

In the tth time period of proposed SIMD FPTD,each algorithmic block of the corresponding odd oreven shading having an index k ∈ {1, 2, 3, . . . , N}accepts five inputs and generates three outputs, asshown in Figure 4(a). In addition to the LLRsba,t−1

1,k , ba2,k and ba

3,k, the kth algorithmic block re-quires the vectors αt-1

k−1 = [αt-1k−1(Sk−1)]M−1

Sk−1=0 andβt-1k = [βt-1

k (Sk)]M−1Sk=0. Here, αt-1

k−1(Sk−1) is the for-ward metric provided for the state Sk−1 ∈ [0,M−1]in the previous time period t-1 by the preceding al-gorithmic block, where the LTE turbo code employsM = 8 states. Similarly, βt-1

k (Sk) is the backwardmetric provided for the state Sk ∈ [0,M − 1] in theprevious time period by the following algorithmicblock.

Sk−1 Sk

0

1

(b1,k ≡ b3,k)/b2,k0/0

1/1

1/0

0/1

0/1

1/0

1/1

0/0

1/1

0/0

0/1

1/0

1/0

0/1

0/0

1/1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Fig. 5: State transition diagram of the LTE turbo code.




The kth algorithmic block combines these inputsusing four steps, which correspond to Equations (1),(2), (3) and (4), respectively. As in the conventionalLog-BCJR turbo decoder, (1) obtains an a priorimetric γtk(Sk−1, Sk) for the transition between aparticular pair of states Sk−1 and Sk. As shownin Figure 5, for the case of the LTE turbo code,this transition implies a particular binary value forthe corresponding message bit b1,k, parity bit b2,k orsystematic bit b3,k. Note that since b3(Sk−1, Sk) ≡b1(Sk−1, Sk) for the LTE turbo code, there are onlyfour possible values for γtk(Sk−1, Sk), namely ba

2,k,(ba,t−1

1,k + ba3,k), (ba,t−1

1,k + ba2,k + ba

3,k) and zero. Allfour of these possible values can be calculated usingas few as two additions, as shown in Figure 6,which provides an optimized datapath for the kth

algorithmic block of the proposed SIMD FPTD.Following this, (2) and (3) may be employed toobtain the state metrics αt

k and βtk−1, respectively.Here, c(Sk−1, Sk) adopts a binary value of 1, if thereis a transition between the states Sk−1 and Sk in thestate transition diagram of Figure 5, while

max∗(δ1, δ2) = max(δ1, δ2) + ln(1 + e−|δ1−δ2|) (5)

is the Jacobian logarithm [16], as is employedby the Log-BCJR decoder. Note that the Jacobianlogarithm may be approximated as

max∗(δ1, δ2) ≈ max(δ1, δ2) (6)

in order to reduce the computational complexity, asin the Max-Log-BCJR. Note that for those tran-sitions having a metric γtk(Sk−1, Sk) of zero, thecorresponding terms in (2) and (3) can be ignored,hence reducing the number of additions required.This is shown in the optimized datapath of Figure 6.Finally, (4) may be employed for obtaining theextrinsic LLR be,t

1,k, as shown in Figure 6. This LLRmay then be output by the algorithmic block, asshown in Figure 4(a).

When operating the kth algorithmic block in thefirst half-iteration of the iterative decoding pro-cess, the a priori message LLR provided by theother row is unavailable, hence it is initialized asba,t−1

1,k = 0, accordingly. Likewise, the forward statemetrics from the neighboring algorithmic blocks areunavailable, hence these are initialized as αt−1

k−1 =[0, 0, 0, . . . , 0] for k ∈ [2, N ]. However, in thecase of the k = 1st algorithmic block, we em-ploy αt−1

0 = [0,−∞,−∞, ...,−∞] in all decoding

iterations, since the LTE trellis is guaranteed tostart from an initial state of S0 = 0. Similarly,before operating the kth algorithmic block in the firsthalf-iteration, we employ βt−1

k = [0, 0, 0, . . . , 0] fork ∈ [1, N − 1]. Furthermore, we employ βt−1

N+3 =[0,−∞,−∞, ...,−∞], since the LTE turbo codingemploys three termination bits to guarantee SN+3 =0. Note that (1), (2), (3) and (4) reveal that βN is in-dependent of αN . Therefore, the algorithmic blockswith indices k ∈ [N+1, N+3], shown as unshadedblocks in Figure 4(a), can be processed beforeand independently of the iterative decoding process.This may be achieved by employing only equations(1) and (3), where the term b3(Sk−1, Sk) · ba

3,k isomitted from (1). More specifically, these equationsare employed in a backward recursion, in orderto successively calculate βN+2, βN+1 and βN , thelatter of which is employed throughout the iterativedecoding process by the N th algorithmic block.

B. Comparison with the FPTD algorithm of [22]

In this section, we compare the proposed SIMDFPTD algorithm with the original FPTD algorithmof [22]. In particular, we compare the operation,temporary storage requirements and computationalcomplexity of these decoders. Note that in analogyto (1), the FPTD algorithm of [22] employs a sum-mation of three a priori LLRs, when operating thealgorithmic blocks of the upper row having an indexk ∈ {1, 2, 3, . . . , N}. However, a summation of justtwo a priori LLRs is employed for the correspond-ing blocks in the lower row of the FPTD algorithmof [22], since in this case the term b3(Sk−1, Sk) · ba

3,k

is omitted from the equivalent of (1). By contrast,the proposed SIMD FPTD algorithm employs (1)in all algorithmic blocks, ensuring that all of themoperate in an identical manner, hence facilitatingSIMD operation, which is desirable for GPGPUimplementations. This is achieved by including thea priori systematic LLR ba

3,k in the calculation of(1), regardless of whether the algorithmic blockappears in the upper or the lower row. Furthermore,in contrast to the FPTD algorithm of [22], ba

3,k isomitted from the calculation of (4), regardless ofwhich row the algorithmic blocks appears in.

A further difference between the proposed SIMDFPTD algorithm and the original FPTD algorithm of[22], is motivated by reductions in memory usageand computational complexity. More specifically,




+

+

+

+

+

+

max*

+

+

+

+

+

+

ba2,k

ba3,k

ba,t−11,k

αk(7)

+

+

+

+

+

+

+

+

+

+

+

+

+

+

ba2,kba,t−11,k + ba2,k + ba3,k

ba,t−11,k + ba3,k

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

Eq.(1)

Eq.(2) Eq.(3)

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

be,t1,k

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

max*

Eq.(4)

+-

+

+

αt−1k−1(0)

αt−1k−1(1)

αt−1k−1(2)

αt−1k−1(3)

αt−1k−1(4)

αt−1k−1(5)

αt−1k−1(6)

αt−1k−1(7)

αtk(0)

αtk(1)

αtk(2)

αtk(3)

αtk(4)

αtk(5)

αtk(6)

βt−1k (0)

βt−1k (1)

βt−1k (2)

βt−1k (3)

βt−1k (4)

βt−1k (5)

βt−1k (6)

βt−1k (7)

βtk−1(0)

βtk−1(1)

βtk−1(2)

βtk−1(3)

βtk−1(4)

βtk−1(5)

βtk−1(6)

βtk−1(7)

αt−1k−1(0)

αt−1k−1(1)

αt−1k−1(2)

αt−1k−1(3)

αt−1k−1(4)

αt−1k−1(5)

αt−1k−1(6)

αt−1k−1(7)

αt−1k−1(0)

αt−1k−1(1)

αt−1k−1(2)

αt−1k−1(3)

αt−1k−1(4)

αt−1k−1(5)

αt−1k−1(6)

αt−1k−1(7)

βt−1k (0)

βt−1k (4)

βt−1k (5)

βt−1k (1)

βt−1k (2)

βt−1k (6)

βt−1k (7)

βt−1k (3)

βt−1k (4)

βt−1k (0)

βt−1k (1)

βt−1k (5)

βt−1k (6)

βt−1k (2)

βt−1k (3)

βt−1k (7)

Fig. 6: The optimized datapath inside the kth algorithmic block ofthe proposed SIMD FPTD algorithm for the LTE turbo decoder.

the algorithmic blocks of the proposed SIMD FPTDalgorithm are redesigned to use fewer intermedi-ate variables and computations. In particular, thetransition metric γk(Sk−1, Sk) of (1) can only adoptthree non-zero values, as described above. By con-trast, the original FPTD algorithm of [22] needsto calculate and store a different transition metricδk(Sk−1, Sk) for each of the sixteen transitions.The proposed approach allows a greater propor-tion of the intermediate variables to be stored inthe GPGPU’s limited number of low-latency regis-ters, with less reliance on its high-latency memory.

Since the GPGPU’s low-latency registers are sharedamong all N algorithmic blocks, the benefit ofreducing the reliance of each block on intermediatevariables is magnified by N times. Owing to this,a slight reduction in the memory usage of eachalgorithmic block results in a huge reduction in thetotal memory usage, especially when N is large.

Furthermore, while the proposed SIMD FPTDalgorithm, the original FPTD algorithm of [22]and the PIVI Log-BCJR decoder all require thesame number of max* operations per algorithmicblocks, the proposed SIMD FPTD algorithm re-quires the fewest additions and subtractions. Morespecifically, as shown in the optimized datapathfor the LTE turbo code of Figure 6, the proposedSIMD FPTD algorithm requires only 45 additionsand subtractions per algorithmic block. This isapproximately 5% lower than the 47.5 additionsand subtractions required by the original FPTDalgorithm of [22], as well as approximately 19%lower than the 55.5 required by the PIVI Log-BCJRdecoder. Note that this computational complexityreduction is achieved by exploiting the relation-ship max*(A + C, B + C) = max*(A, B) + C [50].This relationship holds for both the exact max* of(5) and approximate max* of (6). More specifically,(4) requires sixteen additions for obtaining αk−1+βkfor the sixteen transitions in the LTE trellis, eight ofwhich also require an extra addition for obtainingαk−1 + βk + ba

2,k, before the max* operation. Bygrouping the transitions carefully, the additions ofba

2,k can be moved to after the max* operation.Owing to this, only two additions are required,rather than eight, as shown in Figure 6. Note that thedatapath of Figure 6 has been specifically optimizedfor the LTE turbo code. By contrast, the FPTDalgorithm of [22] is optimized for general turbocode applicability, yielding a more desirable designin the case of the duo-binary WiMAX turbo code[4], for example.

IV. IMPLEMENTATION OF THE SIMD FPTDALGORITHM ON A GPGPU

This section describes the implementation of theproposed SIMD FPTD algorithm using an NVIDIAGPGPU platform, adopting the Compute UnifiedDevice Architecture (CUDA) [20]. The mapping ofthe SIMD FPTD algorithm onto the GPGPU and itsmemory allocation are discussed in Sections IV-A




and IV-B, respectively. The pseudo code of the pro-posed GPGPU kernel designed for implementing theSIMD FPTD algorithm is described in Section IV-C.

A. Mapping the SIMD FPTD algorithm onto aGPGPU

The proposed SIMD FPTD algorithm of Fig-ure 4(a) may be mapped onto a CUDA GPGPUusing a single kernel. Here, two approaches arecompared. In the first approach, each execution ofthe kernel performs one half iteration of the pro-posed algorithm, requiring 2I kernel repetitions inorder to complete I number of decoding iterations.For this approach, the GPU kernel repetitions arescheduled serially by the CPU, achieving synchro-nization between each pair of consecutive half itera-tions by the CPU. This synchronisation ensures thatall parts of a particular half iteration are completed,before any parts of the next half iteration begin.However, this synchronization occupies an averageof 31.3% of the total processing time, which is dueto the communication overhead between the CPUand the GPU, according to our experimental results.Owing to this, our second approach performs all2I half iterations within a single GPU kernel run,eliminating the requirement for any communicationbetween the CPU and the GPU during the iter-ative decoding process. However, the inter-blocksynchronization has to be carried out by the GPUin order to maintain the odd-even nature of theoperation. Since CUDA GPGPUs do not have anynative support for inter-block synchronization, herewe include the lock-free inter-block synchronizationtechnique of [51]. We perform this synchronizationat end of every half iteration, which reduces thetime dedicated to the synchronization from 31.3% to15.5%, according to our experimental results. Ow-ing to this superior performance compared to CPUsynchronization, inter-block synchronization on theGPU is used for our proposed FPTD implementationand its performance is characterized in Section V.

Our kernel employs N number of threads, withone for each of the N algorithmic blocks that areoperated within each half iteration of Figure 4(a).Here, the kth thread processes the kth algorithmicblock in the upper or lower row according to theodd-even arrangement of Figure 4(a), where k ∈[1, N ]. Note that it would be possible to achievefurther parallelism by employing eight threads per

algorithmic block, rather than just one. This wouldfacilitate state-level parallelism as described in Sec-tion I for the conventional GPGPU implementationof the PIVI Log-BCJR turbo decoder. However, ourexperiments reveal that state-level parallelism offersno advantage for the proposed SIMD FPTD algo-rithm. More specifically, according to the Nsightprofiler of [52], the processing throughput of theproposed FPTD implementation is bounded by thememory bandwidth rather than memory access la-tency, which implies that the parallelism of N isalready large enough to make the most of the GPG-PUs computing resource. Furthermore, employingstate-level parallelism would result in a requirementfor more accesses of the global memory, in order toload the a priori LLRs ba

1,k, ba2,k and ba

3,k, whichwould actually degrade the throughput.

The algorithmic blocks of the proposed SIMDFPTD algorithm are arranged in groups of 32, inorder for the corresponding threads to form warps,which are particularly suited to SIMD operation. Inorder to maximize the computation throughput, spe-cial care must be taken to avoid thread divergence.This arises when ‘if’ and ‘else’ statements cause thedifferent threads of a warp to operate differently,resulting in the serial processing of each possibleoutcome. However, the schematic of Figure 4(a) isprone to thread divergence, since each half iterationcomprises the operation of algorithmic blocks inboth the upper and the lower row, as indicated usinglight and dark shading. More specifically, ‘if’ and‘else’ statements are required to determine whethereach algorithmic block resides in the top or bottomrow of Figure 4(a), when deciding which inputs andoutputs to consider. This motivates the alternativedesign of Figure 4(b), in which all algorithmicblocks within the same half iteration have beenrelocated to the same row in order to avoid these‘if’ and ‘else’ statements. More specifically, thealgorithmic blocks that have an even index in theupper row have been swapped with those from thelower row. As a result, the upper row comprises thelightly-shaded blocks labeled uk|k is odd and lk|k is even,whilst the lower row comprises the darkly-shadedblocks labeled uk|k is even and lk|k is odd. Consequently,the operation of alternate half iterations correspondsto the alternate operation of the upper and lowerrows of Figure 4(b). Note that this rearrangementof algorithmic blocks requires a corresponding re-arrangement of inputs, outputs and memory, as will




be discussed in Section IV-B.As described in Section III, the consideration of

the termination bits by the three algorithmic blocksat the end of the upper and lower rows can beisolated from the operation of the iterative pro-cesses. Therefore, we recommend the processing ofall termination bits using the CPU, before beginningthe iterative decoding process on the GPGPU. Thisaids the mapping of algorithmic blocks to warps andalso avoids thread divergence, since the processingof the termination bits is not identical to that of theother bits, as shown in Figure 4(b).

B. Data arrangement and memory allocation

Note that because the proposed SIMD FPTD em-ploys the rearranged schematic of Figure 4(b) ratherthan that of Figure 4(a), the corresponding datasetsmust also be rearranged, using swaps and mergers.More specifically, for the a priori parity LLRsba

2 and the systematic LLRs ba3 the rearrangement

can be achieved by swapping the correspondingelements in the upper and lower datasets, followingthe same rule that was applied to the algorithmicblocks of Figure 4(b). For the forward and back-wards metrics α and β as well as for the a priorimessage LLRs ba

1 the rearrangement can be achievedby merging the two separate datasets for the upperand lower rows together. Furthermore, there is noneed to store both the a priori and the extrinsicLLRs, since interleaving can be achieved by writingthe latter into the memory used for storing theformer, but in an interleaved order. Note that thisarrangement also offers the benefit of minimizingmemory usage, which is achieved without causingany overwriting, as shown in Figure 7. More explic-itly, the kth memory slot Mk of Figure 4(b) may beused for passing the kth forward state metrics α

u/lk

between the algorithmic blocks uk/lk and uk+1/lk+1,for example. During the first half iteration, theupper algorithmic block uk is operated to obtain αu

k,which is stored in Mk. Then during the second halfiteration, this data stored in Mk will be provided tothe algorithmic block uk+1, before it is overwrittenby the new data αl

k, which is provided by thealgorithmic block lk.

As illustrated in Figure 4(b), there are a total ofseven datasets that must be stored throughout the de-coding process, namely [ba,u

2,k]Nk=1, [ba,l

2,k]Nk=1, [ba,u

3,k]Nk=1,

[ba,l3,k]

Nk=1, [ba

1,k]Nk=1, [αk]

Nk=1 and [βk]

Nk=1, requiring

an overall memory resource of 21N floating-pointnumbers. As shown in Figure 4(b), these datasetsare stored in the global memory, since it has alarge capacity and is accessible from the host CPU,as well as from any thread in the GPGPU device.However the global memory has a relatively highaccess latency and a limited bandwidth. In orderto minimize the impact of this, each algorithmicblock employs local low-latency registers to storeall intermediate variables that are required multipletimes within a half iteration. More specifically, thekth algorithmic block uses registers to store ba

2,k,(ba

1,k + ba3,k), (ba

1,k + ba2,k + ba

3,k), αk−1 and βk,comprising a total of 19 floating-point numbers, asshown in Figure 6.

C. Pseudo code

Algorithm 1 describes the operation of the kth

thread dedicated to the computation of the kth algo-rithmic block, in analogy to the datapath of Figure 6.Note that the labels of Register (R) and Globalmemory (G) shown in Algorithm 1 indicate the typeof the memory used for storing the correspondingdata. Each thread is grouped into four steps asfollows. The first step caches the a priori LLR ba

2,k

and the a priori state metrics αk−1 and βk from theglobal memory to the local registers. Furthermore,the first step computes ba

13 = ba,t−11,k + ba

3,k andba

123 = ba,t−11,k + ba

2,k + ba3,k, before storing the results

in the local registers. Following this, the secondand third steps compute the extrinsic forward statemetrics αt

k and the extrinsic backward state metricsβtk−1, in analogy to the datapath of Figure 6. Theresults of these computations are written directlyinto the corresponding memory slot Mk in theglobal memory, as shown in Figure 4(b). In thefourth step, the extrinsic LLR be,t

1,k is computed andstored in the global memory. Here, interleaving ordeinterleaving is achieved by storing the extrinsicLLRs into particular global memory slots selectedaccording to the design of the LTE interleaver. Notethat the intermediate values of δ0 and δ1 require thestorage of two floating-point numbers in registers,as shown in Algorithm 1. However, instead of usingtwo new registers, they can be stored respectivelyin the registers that were previously used for storingthe values of ba

13 and ba123, since these are not

required in the calculations of the fourth step. As aresult, a total of 19 registers are required per thread,




αu0

αu1

αuNαu

2

αu3αl

0

αl1

αlNαl

2

αl3

αu0 αu

2αl1 αl

3

...

...

u1

...

l2 l4

αuN

lN

u2 uNl1 l3 u4

1st halfiteration

2nd halfiteration

M0 M1 M2 M3 MNGlobalmemory

u3

...

tim

e

(a) Forward state metrics α

...

...

u1

...

l2 l4 lN

u2 uNl1 l3 u4

1st halfiteration

2nd halfiteration

M0 M1 M2 M3 MNGlobalmemory

u3

βl0 β

u1 β

l2 β

lNβ

u3

βu0 β

l1 β

u2 β

uNβ

l3

βl0 β

u1 β

l2 β

lNβ

u3

...

tim

e

(b) Backward state metrics β

Fig. 7: Schematic of using the global memory to store the intermediate data of α and β.

as discussed above.

V. RESULTS

In the following sub-sections, we compare theperformance of the proposed GPGPU implemen-tation of our SIMD FPTD algorithm with that ofthe state-of-the-art GPGPU turbo decoder imple-mentation in terms of error correction performance,degree of parallelism, processing throughput andcomplexity. Both turbo decoders were implementedusing single-precision floating-point arithmetic andboth were characterized using the Windows 8 64-bitoperating system, an Intel [email protected] CPU,16GB RAM and an NVIDIA GTX680 GPGPU.This GPGPU has eight Multiprocessors (MPs) and192 CUDA cores per MP, with a GPU clock rate of1.06 GHz and a memory clock rate of 3 GHz.

The state-of-the-art benchmarker employs theLog-BCJR algorithm, with PIVI windowing, state-level parallelism and Radix-2 operation [17], [18].This specific combination was selected, since itoffers a high throughput and a low complexity, at anegligible cost in terms of BER degradation. Thisalgorithm was mapped onto the GPGPU accordingto the approach described in [13]. Furthermore, asrecommended in [13] and [18], the longest LTEframes comprising N = 6144 bits were decom-posed into P ∈ {192, 128, 96, 64, 32} number ofpartitions. This is equivalent to having PIVI windowlengths of W = N/P ∈ {32, 48, 64, 96, 192},respectively.

A. BER performance

Figure 8 compares the BER performance of boththe PIVI Log-BCJR turbo decoder and the pro-posed SIMD FPTD algorithm, when employing the

Eb/N

0 (dB)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

BE

R

10 -6

10 -5

10 -4

10 -3

10 -2

10 -1

10 0

FPTD

BCJR

0.8 0.9 110 -5

10 -4

N = 6144

N = 768

I (FPTD)W (BCJR)

Fig. 8: BER performance for the PIVI Log-BCJR turbo decoderhaving window lengths of W ∈{32, 48, 64, 96, 192} and performingI = 7 iterations, as compared with that of the proposed SIMDFPTD when performing I ∈{36, 39, 42, 46, 49} iterations. Here,both decoders use the approximate max* operation.

approximate max* operation of (6). Here, BPSKmodulation was employed for transmission overan AWGN channel. For both decoders, the BERperformance is provided for a relatively short framelength of N = 768 bits, as well as for the longestframe length that is supported by the LTE standard,namely N = 6144 bits. We have not includedthe BER performance of the two decoders whenemploying the exact max* operation of (5), but wefound that they obey the same trends as Figure 8.

Figure 8 characterizes the BER performance ofthe PIVI Log-BCJR turbo decoder when employingI = 7 iterations and the window lengths of W ∈{32, 48, 64, 96, 192}. In addition to this, Figure 8provides the BER performance of the SIMD FPTDalgorithm when performing I ∈ {36, 39, 42, 46, 49}




Algorithm 1 A kernel for computing a half-iterationof the proposed SIMD FPTD algorithm

Step 1: Loading datafor i = 0 to 7 do

(R) α(i)← (G) αt−1k−1(i)

(R) β(i)← (G) βt−1k (i)end for(R) ba13 ← (G) ba,t−11,k + (G) ba3,k(R) ba2 ← (G) ba2,k(R) ba123 ← ba2 + ba13

Step 2:Computing forward state metrics(G) αtk(0)← max*(α(0) , α(1) + ba123)(G) αtk(1)← max*(α(2) + ba13 , α(3) + ba2)(G) αtk(2)← max*(α(4) + ba2 , α(5) + ba13)(G) αtk(3)← max*(α(6) + ba123 , α(7))(G) αtk(4)← max*(α(0) + ba13 , α(1))(G) αtk(5)← max*(α(2) + ba2 , α(3) + ba13)(G) αtk(6)← max*(α(4) + ba13 , α(5) + ba2)(G) αtk(7)← max*(α(6) , α(7)) + ba123)

Step 3:Computing backward state metrics(G) βtk−1(0)← max*(β(0) , β(4) + ba123)(G) βtk−1(1)← max*(β(0) + ba123 , β(4))(G) βtk−1(2)← max*(β(1) + ba13 , β(5) + ba2)(G) βtk−1(3)← max*(β(1) + ba2 , β(5) + ba13)(G) βtk−1(4)← max*(β(2) + ba2 , β(6) + ba13)(G) βtk−1(5)← max*(β(2) + ba13 , β(6) + ba2)(G) βtk−1(6)← max*(β(3) + ba123 , β(7))(G) βtk−1(7)← max*(β(3) , β(7)) + ba123)

Step 4:Computing extrinsic LLR(R) δ0 ← max*(α(2) + β(5) , α(3) + β(1))δ0 ← max*(δ0 , α(4) + β(2))δ0 ← max*(δ0 , α(5) + β(6))δ0 ← δ0 + ba2δ0 ← max*(δ0 , α(0) + β(0))δ0 ← max*(δ0 , α(1) + β(4))δ0 ← max*(δ0 , α(6) + β(7))δ0 ← max*(δ0 , α(7) + β(3))

(R) δ1 ← max*(α(0) + β(4) , α(1) + β(0))δ1 ← max*(δ1 , α(6) + β(3))δ1 ← max*(δ1 , α(7) + β(7))δ1 ← δ1 + ba2δ1 ← max*(δ1 , α(2) + β(1))δ1 ← max*(δ1 , α(3) + β(5))δ1 ← max*(δ1 , α(4) + β(6))δ1 ← max*(δ1 , α(5) + β(2))

(G) be,t1,π(k) ← δ1 − δ0

iterations. As may be expected, the BER perfor-mance of the PIVI Log-BCJR turbo decoder im-proves when employing longer window lengths W .Therefore, more iterations I of the SIMD FPTDalgorithm are required in order to achieve the sameBER performance as the PIVI Log-BCJR turbodecoder, when W is increased. More specifically,Figure 8 shows that when employing N = 6144-bit frames, the SIMD FPTD algorithm requiresI ∈ {36, 39, 42, 46, 49} decoding iterations in or-der to achieve the same BER performance as thePIVI Log-BCJR turbo decoder performing I =7 iterations with the window lengths of W ∈{32, 48, 64, 96, 192}, respectively. Note that in allcases, the proposed SIMD FPTD algorithm is capa-ble of achieving the same BER performance as thePIVI Log-BCJR turbo decoder, albeit at the cost ofrequiring a greater number of decoding iterationsI . Note that the necessity for the FPTD to performseveral times more iterations than the Log-BCJRturbo decoder was discussed extensively in [22].

B. Degree of parallelismThe degree of parallelism for the PIVI Log-BCJR

turbo decoder may be considered to be given byDLog-BCJRp = M ·N

W, where N is the frame length,

W is the window length and M = 8 is thenumber of states in the LTE turbo code trellis. Here,M = 8 threads can be employed for achievingstate parallelism, while decoding each of the N/Wwindows simultaneously. By contrast, the degree ofparallelism for the FPTD can be simply defined asDFPTDp = N , since all algorithmic blocks can be

processed in parallel threads and because we donot exploit state parallelism in this case. Table Icompares the parallelism Dp of the proposed SIMDFPTD with that of the PIVI Log-BCJR turbo de-coder, when decomposing N = 6144-bit framesinto windows comprising various numbers of bitsW . Depending on the window length W chosen forthe PIVI Log-BCJR turbo decoder, the degree ofparallelism achieved by the proposed SIMD FPTDcan be seen to be between 4 and 24 times higher.

C. Processing latencyFigure 9 compares the processing latency of both

the proposed SIMD FPTD and of the PIVI Log-BCJR decoder, when decoding frames comprisingN = 6144 bits using both the approximate max*




TABLE I: Comparison between the PIVI Log-BCJR turbo decoder and the proposed SIMD FPTD in terms of degree of parallelism, overalllatency, pipelined throughput and complexity (IPBPHI and IPB), where N = 6144 for both decoders, I = 7 and W ∈ {32, 48, 64, 96, 192}for the PIVI Log-BCJR turbo decoder, whilst I ∈ {36, 39, 42, 46, 49} for the FPTD. Results are presented using the format x/y, where xcorresponds to the case where the approximate max* operation of (6) is employed, while y corresponds to the exact max* operation of (5).

Degree ofParallelism

Overall Latency Pipelinedthroughput

Complexity

Dp (µs) (Mbps) IPW IPBPHI IPB

Log-BCJR

W = 32 1536 816.9 / 1041.5 8.2 / 6.3 2511 / 3842 19.6 / 30.0 274 / 420W = 48 1024 1111.5 / 1415.5 5.9 / 4.6 3720 / 5697 19.4 / 29.7 272 / 416W = 64 768 1338.9 / 1786.6 4.8 / 3.6 5044 / 7914 19.7 / 30.9 276 / 433W = 96 512 1945.7 / 2549.3 3.3 / 2.5 7368 / 11.3k 19.2 / 29.4 269 / 412W = 192 256 3694.6 / 4842.6 1.7 / 1.3 14.7k / 22.6k 19.1 / 29.4 267 / 412

FPTD

I = 36

6144

402.5 / 451.8 18.7 / 16.1

200 / 439 6.3 / 13.8

454 / 994I = 39 427.4 / 482.1 17.4 / 14.9 491 / 1076I = 42 454.6 / 516.4 16.2 / 13.9 529 / 1159I = 46 486.6 / 556.2 14.8 / 12.7 580 / 1270I = 49 513.4 / 589.6 13.9 / 12 617 / 1352

I (FPTD) / W (Log-BCJR)36 / 32 39 / 48 42 / 64 46 / 96 49 / 192

Latency

(ms)

0

1

2

3

4

5

FP

TD

appro

x. m

ax*

FP

TD

exact m

ax*

Log-B

CJR

appro

x. m

ax*

Log-B

CJR

exact m

ax*

Memory copy Process

Fig. 9: Latency for the proposed SIMD FPTD with I ∈{36, 39, 42,46, 49}, as compared with that of the PIVI Log-BCJR turbo decoderwith I=7 and W ∈{32, 48, 64, 96, 192}.

operation of (6) and the exact max* operationof (5). Note that different numbers of iterationsI ∈ {36, 39, 42, 46, 49} are used for the SIMDFPTD, while I = 7 iterations and different windowlengths W ∈ {32, 48, 64, 96, 192} are employed forthe PIVI Log-BCJR turbo decoder, as discussed inSection V-A. Here, the overall latency includes twoparts, namely the time used for memory transferbetween the CPU and the GPU, as well as thetime used for the iterative decoding process. Morespecifically, the memory transfer includes transfer-ring the channel LLRs from the CPU to the GPU atthe beginning of the iterative decoding process andtransferring the decoded results from the GPU to the

CPU at the end of that process. Therefore, the timeused for memory transfer depends only on the framelength N and it is almost independent of the typeof decoder and the values of I and W , as shownin Figure 9. Note that the latency was quantifiedby averaging over the decoding of 5000 frames foreach configuration. By contrast, the time used forthe iterative decoding process differs significantlybetween the proposed SIMD FPTD and the Log-BCJR turbo decoder. More specifically, Table Ishows that the overall latency of the SIMD FPTDis in the range from 402.5 µs to 513.4 µs, whenthe number of iterations is increased from I = 36to I = 49, provided that the approximate max*operation of (6) is employed, hence meeting the sub1ms requirement of the LTE physical layer [53].By contrast, the overall latency of the PIVI Log-BCJR decoder ranges from 816.9 µs to 3694.6 µs,when the window length increases from W = 32 toW = 192, and when I = 7 iterations are performed,assuming that the approximate max* operation of(6) is employed. These extremities of the rangeare 2 times and 7.2 times worse than those of theproposed SIMD FPTD, respectively. Additionally,when the exact max* operation of (5) is employed,the overall latency of the SIMD FPTD increases by12.3% and 14.8% for the case of I = 36 and I = 49,compared to those obtained when employing theapproximate max* operation of (6). By contrast, theoverall latency increases in this case by 27.5% and31.1% for the PIVI Log-BCJR decoder associatedwith W = 32 and W = 192, hence further wideningthe gap to the latency of the proposed SIMD FPTD.




D. Processing throughput

Table I presents the processing throughputs thatwere measured on the GPGPU, when employingthe proposed SIMD FPTD and the PIVI Log-BCJR turbo decoder to decode frames comprisingN = 6144 bits. Here, throughputs are presentedfor the case where the approximate max* operationof (6) is employed, as well as for the case ofemploying the exact max* operation of (5). Notethat when the iterative decoding of a particularframe has been completed, its memory transferfrom the GPU to CPU can be pipelined with theiterative decoding of the next frame and with thememory transfer from the CPU to GPU of the frameafter that. Since Figure 9 shows that the iterativedecoding is the slowest of these three processes,it imposes a bottleneck on the overall processingthroughput. Owing to this, the throughput presentedin Table I was obtained by considering only theiterative decoding process, based on the assumptionthat throughput = N/latency of iterative decoding.As shown in Table I, the proposed GPGPU imple-mentation of the SIMD FPTD achieves throughputsof up to 18.7 Mbps. This exceeds the averagethroughput of 7.6 Mbps, which is typical in 100MHz LTE uplink channels [54], demonstrating thesuitability of the proposed implementation for C-RAN applications. Furthermore, higher throughputscan be achieved either by using a more advancedGPU or by using multiple GPUs in parallel.

Recall from Figure 8 that the proposed SIMDFPTD performing I = 36 iterations achieves thesame BER performance as the PIVI Log-BCJRturbo decoder performing I = 7 iterations andhaving the window length of W = 32. Here,W = 32 corresponds to the maximum degree ofparallelism of P = 192 that can be achieved for thePIVI Log-BCJR turbo decoder, without imposing asignificant BER performance degradation [19]. Inthe case of this comparison, Table I reveals thatthe processing throughput of the proposed SIMDFPTD is 2.3 times and 2.5 times higher than thatof the PIVI Log-BCJR turbo decoder, when theapproximate max* operation and the exact max* op-eration are employed, respectively. An even higherprocessing throughput improvement is offered bythe proposed SIMD FPTD, when the parallelism

of the PIVI Log-BCJR turbo decoder is reduced,for the sake of improving its BER performance.For example, the proposed SIMD FPTD performingI = 49 iterations has a processing throughput thatis 8.2 times (approximate max*) and 9.2 times(exact max*) higher than the PIVI Log-BCJR turbodecoder having a window length of W = 192,while offering the same BER performance. Notethat owing to its lower computational complexity,the approximate max* operation of (6) facilities ahigher processing throughput than the exact max*operation of (5), in the case of both decoders.

Furthermore, Table II compares the processingthroughput of the proposed SIMD FPTD GPGPUimplementation to that of other GPGPU imple-mentations of the LTE turbo decoder found in theliterature [13], [17], [18], [38], [43]. Here, thethroughputs of all implementations are quantifiedfor the case of decoding N = 6144-bit frames,when using the approximate max* operation of (6).Note that the throughputs shown in Table II forthe benchmarkers employing the PIVI Log-BCJRalgorithm have been linearly scaled to the case ofperforming I = 7 iterations, in order to perform afair comparison. However, different GPUs are usedfor the different implementations, which compli-cates their precise performance comparison. In orderto make fairer comparisons, we consider two differ-ent methods for normalizing the throughputs of thedifferent implementations, namely throughput×106

clock freq.×mem BW

and throughput×106

clock freq.×mem freq. .More specifically, the authors of [38] proposed a

loosely synchronized parallel turbo decoding algo-rithm, in which the iterative operation of the parti-tions is not guaranteed to operate synchronously. Intheir contribution, the normalized throughput wasobtained as throughput×106

clock freq.×mem BW , since the GPGPU’sglobal memory bandwidth impose the main bottle-neck upon the corresponding implementation. Sim-ilarly, we suggest using the same normalizationmethod for our proposed FPTD, since its perfor-mance is also bounded by the global memory band-width, according to the experimental results fromthe Nsight profiler, as discussed in Section IV-A. Asshown in Table II, the benchmarker of [38] achievesa normalized throughput of 100.6, when performingI = 12 iterations for decoding N = 6144-bitframes, divided into P = 768 partitions. However,this approach results in an Eb/N0 degradation of




TABLE II: Comparison between the proposed GPGPU implementation of our SIMD FPTD algorithm with other GPGPU implementationsof the turbo decoders found in the literature.

Implementations This work [18] [17] [43] [13] [38]GPU GTX 680 GeForce

9800 GX2TeslaK20c

GTXTitan

TeslaC1060

GTX 480

Clock freq. [MHz] 1006 1500 706 836 1296 1401Memory freq.[MHz]

1502 1000 1300 1502 800 924

Memorybandwidth [GB/s]

192.2 128 208 288 102 177.4

Algorithm SIMDFPTD

Log-BCJR

Log-BCJR

Log-BCJR

Log-BCJR

Log-BCJR

ParallelLog-BCJR

Windowingmechanism

odd-even

PIVI PIVI PIVI PIVI PIVI loosesynchro-nization

Window size W 1 32 32 32 32 32 8Iteration I 36 (271) 7 7 7 7 7 121

Throughput[Mbps]

18.7(24.91)

8.2 1.9 4 3.2 6.8 251

Normalizedthroughput2

96.7(128.81)

42.4 9.9 27.2 13.3 51.4 100.61

Normalizedthroughput3

12.4(16.51)

5.4 1.3 3.0 4.4 6.6 19.31

1 Subject to a cost of 0.2 dB degradation in frame error rate performance.2 The throughput is normalized by throughput×106

clock freq.×mem BW , which is appropriate for applications where the performance is boundedby the global memory bandwidth, as used in [38].3 The throughput is normalized by throughput×106

clock freq.×mem freq. , which is appropriate for applications where the performance is boundedby the compute latency and the memory latency.

0.2 dB, compared to that of the conventional Log-BCJR turbo decoding algorithm employing P = 64partitions and performing I = 6 iterations. Whentolerating this 0.2 dB degradation, our proposedSIMD FPTD algorithm requires only I = 27 itera-tions, rather than I = 36, as shown in Table II. Inthis case, the normalized throughput of our proposedSIMD FPTD algorithm is 128.8, which is 28%higher than that of the loosely synchronized parallelturbo decoding algorithm of [38]. Furthermore, ourapproach has the advantage of being able to main-tain a constant BER performance, while the looselysynchronized parallel turbo decoding algorithm of[38] suffers from a BER performance that variesfrom frame to frame, owing to its asynchronousdecoding process.

By contrast, using the normalization ofthroughput×106

clock freq.×mem freq. is more appropriate for the

other implementations listed in Table II, sinceaccording to our experimental results, thecomputational latency and the global memoryaccess latency constitute the main bottlenecks ofthese implementations of the Log-BCJR algorithm.This may be attributed to the low degree ofparallelism of the decoder compared to thecapability of the GPGPU, particularly when onlya single frame is decoded at a time. Note thatthe normalized throughputs obtained using thedifferent normalization methods are not comparableto each other. As shown in Table II, the normalizedthroughput of 5.4 achieved by our PIVI Log-BCJRbenchmarker is significantly better than those of[18], [17] and [43]. Although the benchmarkerof [13] achieves a better normalized throughputof 6.6, this is achieved by decoding a batch of100 frames at a time, which can readily achieve




a higher degree of parallelism than decoding onlya single frame at a time, like all of the otherschemes, as discussed in [18]. Owing to this, thecomputing latency and memory latency maybe nolonger a limiter for the throughput performance,implying that the normalized throughput for [13]may be inappropriate. Additionally, this throughputcan only be achieved, when there are 100 framesavailable for simultaneous decoding, which maynot occur frequently in practice, hence resulting inan unfair comparison with the other benchmarkers.Furthermore, while decoding several frames inparallel improves the overall processing throughput,it does not improve the processing latency of eachframe.

E. Complexity

The complexity of the proposed GPGPU imple-mentation of our SIMD FPTD algorithm may becompared with that of the PIVI Log-BCJR turbodecoder by considering the number of GPGPUinstructions that are issued per bit of the messageframe. This is motivated, since all GPGPU threadoperations are commanded by instructions. Morespecifically, while performing one half iterationand one interleaving operation for each turbo de-coder, the average number Instructions Per Warp(IPW) was measured using the NVIDIA analysistool, Nsight [52]. Using this, the average numberof Instructions Per Bit (IPB) may be obtained as

IPB = 2I · IPBPHI =2I · IPW ·Dp

32N, where

IPBPHI is the average number of Instructions PerBit Per Half Iteration, N is the frame length andDp is the corresponding degree of parallelism. Here,Dp

32represents the total number of warps, since each

warp includes 32 of the Dp threads employed by thedecoder.

Table I quantifies IPBPHI and IPB for both theproposed SIMD FPTD and the PIVI Log-BCJRturbo decoder, when employing both the approx-imate and exact max* operations of (6) and (5),respectively. As shown in Table I, the IPBPHI ofthe proposed SIMD FPTD is around one third thatof the PIVI Log-BCJR turbo decoder, when employ-ing the approximate max* operation, although thisratio grows to one half, when employing the exactmax* operation. Note however that the proposedSIMD FPTD algorithm requires more decoding

iterations than the PIVI Log-BCJR turbo decoderfor achieving a particular BER performance, asquantified in Section V-A. Therefore, the overallIPB complexity of the proposed SIMD FPTD is1.7 to 3.3 times higher than that of the PIVI Log-BCJR turbo decoder, depending on the number ofiterations I , window length W and type of max*operation performed, as shown in Table I. Note thatthis trend broadly agrees with that of our previouswork [22], which showed that the FPTD algorithmhas a complexity that is 2.9 times higher than thatof the state-of-the-art LTE turbo decoder employingthe Log-BCJR algorithm, which was obtained bycomparing the number of additions/subtractions andmax* operations employed by the different algo-rithms. Note that the increased complexity of theFPTD represents the price that must be paid forincreasing the decoding throughput by a factor upto 9.1.

VI. CONCLUSIONS

In this paper, we have proposed a SIMD FPTDalgorithm and demonstrated its implementation ona GPGPU. We have also characterized its perfor-mance in terms of BER performance, degree of par-allelism, GPGPU processing throughput and com-plexity. Furthermore, these characteristics have beencompared with those of the state-of-the-art PIVILog-BCJR turbo decoder. This comparison showsthat owing to its increased degree of parallelism, theproposed SIMD FPTD offers a processing through-put that is between 2.3 and 9.2 times higher and aprocessing latency that is between 2 and 8.2 timesbetter than that of the benchmarker. However, this isachieved at the cost of requiring a greater number ofiterations than the benchmarker in order to achievea particular BER performance, which may result ina 1.7 to 3.3 times increase in overall complexity.In our future work we will conceive techniquesfor disabling particular algorithmic blocks in theFPTD, once they have confidently decoded theircorresponding bits. With this approach, we expectto significantly reduce the complexity of the FPTD,such that it approaches that of the Log-BCJR turbodecoder, without compromising the BER perfor-mance.




REFERENCES

[1] J. Woodard and L. Hanzo, “Comparative Study Of TurboDecoding Techniques: An Overview,” IEEE Transactions onVehicular Technology, vol. 49, pp. 2208–2233, Nov 2000.

[2] S. Chaabouni, N. Sellami, and M. Siala, “Mapping Optimiza-tion for a MAP Turbo Detector Over a Frequency-SelectiveChannel,” IEEE Transactions on Vehicular Technology, vol. 63,pp. 617–627, Feb 2014.

[3] M. Brejza, L. Li, R. Maunder, B. Al-Hashimi, C. Berrou, andL. Hanzo, “20 Years of Turbo Coding and Energy-Aware De-sign Guidelines for Energy-Constrained Wireless Applications,”IEEE Communications Surveys & Tutorials, vol. PP, pp. 1–1,Jun 2015.

[4] IEEE, “IEEE Standard for Local and Metropolitan Area Net-works Part 16: Air Interface for Broadband Wireless AccessSystems,” 2012.

[5] ETSI, “LTE; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding,” Feb 2013.

[6] A. A. Abidi, “The Path to the Software-Defined Radio Re-ceiver,” IEEE Journal of Solid-State Circuits, vol. 42, pp. 954–966, May 2007.

[7] P. Demestichas, A. Georgakopoulos, D. Karvounas,K. Tsagkaris, V. Stavroulaki, J. Lu, C. Xiong, and J. Yao,“5G on the Horizon: Key Challenges for the Radio-AccessNetwork,” IEEE Vehicular Technology Magazine, vol. 8,pp. 47–53, Sep 2013.

[8] D. Wubben, P. Rost, J. S. Bartelt, M. Lalam, V. Savin,M. Gorgoglione, A. Dekorsy, and G. Fettweis, “Benefits andImpact of Cloud Computing on 5G Signal Processing: Flexiblecentralization through cloud-RAN,” IEEE Signal ProcessingMagazine, vol. 31, pp. 35–44, Nov 2014.

[9] W. Chen, Y. Chang, S.g Lin, L. Ding, and L. Chen, “EfficientDepth Image Based Rendering with Edge Dependent Depth Fil-ter and Interpolation,” in 2005 IEEE International Conferenceon Multimedia and Expo, (Amsterdam), pp. 1314–1317, IEEE,Jul 2005.

[10] R. Li, Y. Dou, J. Zhou, L. Deng, and S. Wang, “CuSora: Real-time Software Radio using Multi-core Graphics ProcessingUnit,” Journal of Systems Architecture, vol. 60, pp. 280–292,Mar 2014.

[11] S. Roger, C. Ramiro, A. Gonzalez, V. Almenar, and A. M. Vi-dal, “Fully Parallel GPU Implementation of a Fixed-ComplexitySoft-Output MIMO Detector,” IEEE Transactions on VehicularTechnology, vol. 61, pp. 3796–3800, Oct 2012.

[12] Q. Zheng, Y. Chen, H. Lee, R. Dreslinski, C. Chakrabarti,A. Anastasopoulos, S. Mahlke, and T. Mudge, “Using GraphicsProcessing Units in an LTE Base Station,” Journal of SignalProcessing Systems, vol. 78, pp. 35–47, Jan 2015.

[13] M. Wu, Y. Sun, and J. R. Cavallaro, “Implementation of a 3GPPLTE turbo decoder accelerator on GPU,” in IEEE Workshop onSignal Processing Systems, SiPS: Design and Implementation,(San Francisco, CA), pp. 192–197, IEEE, Oct 2010.

[14] D. Lee, M. Wolf, and H. Kim, “Design Space Exploration ofthe Turbo Decoding Algorithm on GPUs,” in Proceedings ofthe international conference on Compilers, architectures andsynthesis for embedded systems, CASES ’10, (Scottsdale, AZ,USA), pp. 217–226, ACM Press, Oct 2010.

[15] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal Decodingof Linear Codes for Minimizing Symbol Error Rate (Corresp.),”IEEE Transactions on Information Theory, vol. 20, pp. 284–287, Mar 1974.

[16] P. Robertson, E. Villebrun, and P. Hoeher, “A Comparison ofOptimal and Sub-optimal MAP Decoding Algorithms Operat-ing in the Log Domain,” in Proceedings IEEE International

Conference on Communications ICC ’95, vol. 2, (Seattle, WA,USA), pp. 1009–1013, IEEE, Jun 1995.

[17] Y. Zhang, Z. Xing, L. Yuan, C. Liu, and Q. Wang, “TheAcceleration of Turbo Decoder on the Newest GPGPU ofKepler Architecture,” in 14th International Symposium on Com-munications and Information Technologies (ISCIT), (Incheon),pp. 199–203, IEEE, Sep 2014.

[18] D. R. N. Yoge and N. Chandrachoodan, “GPU Implementationof a Programmable Turbo Decoder for Software Defined RadioApplications,” in 2012 25th International Conference on VLSIDesign, (Hyderabad), pp. 149–154, IEEE, Jan 2012.

[19] A. Nimbalker, T. Blankenship, B. Classon, T. Fuja, andD. Costello, “Contention-Free Interleavers for High-ThroughputTurbo Decoding,” IEEE Transactions on Communications,vol. 56, pp. 1258–1267, Aug 2008.

[20] NVIDIA, CUDA C PROGRAMMING GUIDE, v6.5 ed., Mar2015.

[21] M. Wu, Y. Sun, G. Wang, and J. R. Cavallaro, “Implementationof a High Throughput 3GPP Turbo Decoder on GPU,” Journalof Signal Processing Systems, vol. 65, pp. 171–183, Nov 2011.

[22] R. G. Maunder, “A Fully-Parallel Turbo Decoding Algorithm,”IEEE Transactions on Communications, vol. 63, pp. 2762–2775, Aug 2015.

[23] L. Sousa, S. Momcilovic, V. Silva, and G. Falcao, “Multi-corePlatforms for Signal Processing: Source and Channel Coding,”in 2009 IEEE International Conference on Multimedia andExpo, (New York, NY), pp. 1809–1812, IEEE, Jun 2009.

[24] E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard, Multi-GPU and Multi-CPU Parallelization for Interactive PhysicsSimulations, vol. 6272 of Lecture Notes in Computer Science.Springer Berlin Heidelberg, 2010.

[25] S. Cook, CUDA Programming: A Developer’s Guide to ParallelComputing with GPUs. Applications of GPU Computing Series,Elsevier Science, 2012.

[26] H. Berg, C. Brunelli, and U. Lucking, “Analyzing Modelsof Computation for Software Defined Radio Applications,” inInternational Symposium on System-on-Chip, (Tampere), pp. 1–4, IEEE, Nov 2008.

[27] G. Falcao, L. Sousa, and V. Silva, “Massive Parallel LDPCDecoding on GPU,” in Proceedings of the 13th ACM SIGPLANSymposium on Principles and practice of parallel programming- PPoPP ’08, (New York, New York, USA), p. 83, ACM Press,2008.

[28] S. Wang, S. Cheng, and Q. Wu, “A Parallel Decoding Algo-rithm of LDPC codes using CUDA,” in 2008 42nd AsilomarConference on Signals, Systems and Computers, pp. 171–175,IEEE, Oct 2008.

[29] G. Falcao, S. Yamagiwa, V. Silva, and L. Sousa, “Parallel LDPCDecoding on GPUs Using a Stream-Based Computing Ap-proach,” Journal of Computer Science and Technology, vol. 24,pp. 913–924, Sep 2009.

[30] G. Falcao, V. Silva, and L. Sousa, “How GPUs Can OutperformASICs for Fast LDPC Decoding,” in Proceedings of the 23rdinternational conference on Conference on Supercomputing -ICS ’09, (New York, New York, USA), p. 390, ACM Press,2009.

[31] J. Kim, S. Hyeon, and S. Choi, “Implementation of an SDRSystem using Graphics Processing Unit,” IEEE Communica-tions Magazine, vol. 48, pp. 156–162, Mar 2010.

[32] C. Ahn, J. Kim, J. Ju, J. Choi, B. Choi, and S. Choi, “Imple-mentation of an SDR Platform using GPU and Its Applicationto a 2 x 2 MIMO WiMAX System,” Analog Integrated Circuitsand Signal Processing, vol. 69, pp. 107–117, Dec 2011.

[33] P.-H. Horrein, C. Hennebert, and F. Petrot, “Integration of




GPU Computing in a Software Radio Environment,” Journalof Signal Processing Systems, vol. 69, pp. 55–65, Oct 2012.

[34] S. Gronroos, K. Nybom, and J. Bjorkqvist, “Complexity Anal-ysis of Software Defined DVB-T2 Physical Layer,” AnalogIntegrated Circuits and Signal Processing, vol. 69, pp. 131–142, Dec 2011.

[35] M. Wu, Y. Sun, S. Gupta, and J. R. Cavallaro, “Implementationof a High Throughput Soft MIMO Detector on GPU,” Journalof Signal Processing Systems, vol. 64, pp. 123–136, Jul 2011.

[36] G. Falcao, J. Andrade, V. Silva, and L. Sousa, “Real-time DVB-S2 LDPC Decoding on Many-core GPU Accelerators,” in 2011IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pp. 1685–1688, IEEE, May 2011.

[37] M. Dardaillon, K. Marquet, T. Risset, and A. Scherrer,“Software Defined Radio Architecture Survey for CognitiveTestbeds,” in 2012 8th International Wireless Communica-tions and Mobile Computing Conference (IWCMC), (Limassol),pp. 189–194, IEEE, Aug 2012.

[38] X. Jiao, C. Chen, P. Jaaskelainen, V. Guzma, and H. Berg, “A122Mb/s Turbo Decoder using a Mid-range GPU,” in 2013 9thInternational Wireless Communications and Mobile ComputingConference (IWCMC), (Sardinia), pp. 1090–1094, IEEE, Jul2013.

[39] M. Wu, G. Wang, B. Yin, C. Studer, and J. R. Cavallaro,“HSPA;/LTE-A Turbo Decoder on GPU and Multi-core CPU,” in 2013 Asilomar Conference on Signals, Systemsand Computers, no. i, (Pacific Grove, CA), pp. 824–828, IEEE,Nov 2013.

[40] G. Wang, M. Wu, B. Yin, and J. R. Cavallaro, “High Through-put Low Latency LDPC Decoding on GPU for SDR Systems,”in 2013 IEEE Global Conference on Signal and InformationProcessing, no. 3, pp. 1258–1261, IEEE, Dec 2013.

[41] V. Adhinarayanan, T. Koehn, K. Kepa, W.-c. Feng, andP. Athanas, “On the Performance and Energy Efficiency ofFPGAs and GPUs for Polyphase Channelization,” in Interna-tional Conference on ReConFigurable Computing and FPGAs(ReConFig14), (Cancun), pp. 1–7, IEEE, Dec 2014.

[42] S. Bang, C. Ahn, Y. Jin, S. Choi, J. Glossner, and S. Ahn,“Implementation of LTE System on an SDR Platform usingCUDA and UHD,” Analog Integrated Circuits and SignalProcessing, vol. 78, pp. 599–610, Mar 2014.

[43] H. Ahn, Y. Jin, S. Han, S. Choi, and S. Ahn, “Design andImplementation of GPU-based Turbo Decoder with a MinimalLatency,” in The 18th IEEE International Symposium on Con-sumer Electronics (ISCE 2014), (JeJu Island), pp. 1–2, IEEE,Jun 2014.

[44] J. a. Briffa, “A GPU Implementation of a MAP Decoder forSynchronization Error Correcting Codes,” IEEE Communica-tions Letters, vol. 17, pp. 996–999, May 2013.

[45] J. a. Briffa, “Graphics Processing Unit Implementation andOptimisation of a Flexible Maximum A-posteriori Decoderfor Synchronisation Correction,” The Journal of Engineering,pp. 1–13, Jan 2014.

[46] Z. Yue and F. C. M. Lau, “Implementation of Decoders forLDPC Block Codes and LDPC Convolutional Codes Based onGPUs,” IEEE Transactions on Parallel and Distributed Systems,vol. 25, pp. 663–672, Mar 2014.

[47] J.-H. Hong and K.-s. Chung, “Parallel LDPC Decoding on aGPU using OpenCL and Global memory for Accelerators,” in2015 IEEE International Conference on Networking, Architec-ture and Storage (NAS), pp. 353–354, IEEE, Aug 2015.

[48] F. Ge, H. Lin, A. Khajeh, C. J. Chiang, M. E. Ahmed, W. B.Charles, W.-c. Feng, and R. Chadha, “Cognitive Radio Rideson the Cloud,” in Military Communications Conference, (SanJose, CA), pp. 1448–1453, IEEE, Oct 2010.

[49] C. Berrou and A. Glavieux, “Near Optimum Error CorrectingCoding and Decoding: Turbo-codes,” IEEE Transactions onCommunications, vol. 44, pp. 1261–1271, Oct 1996.

[50] J. Erfanian, S. Pasupathy, and G. Gulak, “Reduced ComplexitySymbol Detectors with Parallel Structure for ISI Channels,”IEEE Transactions on Communications, vol. 42, pp. 1661–1671, Feb 1994.

[51] S. Xiao and W.-c. Feng, “Inter-block GPU Communicationvia Fast Barrier Synchronization,” in 2010 IEEE InternationalSymposium on Parallel & Distributed Processing (IPDPS),pp. 1–12, IEEE, Apr 2010.

[52] NVIDIA, NVIDIA Nsight Visual Studio Edition 4.6 User Guide,4.6 ed., 2015.

[53] ETSI, “ETSI TS 1 136 213,” Etsi, vol. V12.7.0, 2015.[54] MOTOROLA, “Real-World Lte Performance for Public Safety,”

White paper, Sep, 2010.

An Li received his first class honors BEngdegree in Electronic Engineering from the Uni-versity of Southampton in 2011 and his MPhildegree from the University of Cambridge in2012. He is currently a PhD student in Wire-less Communication research group in the Uni-versity of Southampton. His research interestsinclude parallel turbo decoding algorithms andtheir implementations upon VLSI, FPGA and

GPGPU.

Robert G. Maunder has studied with Elec-tronics and Computer Science, University ofSouthampton, UK, since October 2000. Hewas awarded a first class honors BEng inElectronic Engineering in July 2003, as wellas a PhD in Wireless Communications in De-cember 2007. He became a lecturer in 2007and an Associated Professor in 2013. Rob’sresearch interests include joint source/channel

coding, iterative decoding, irregular coding and modulation tech-niques. For further information on this research, please refer tohttp://users.ecs.soton.ac.uk/rm.

Bashir M. Al-Hashimi (M99-SM01-F09) is aProfessor of Computer Engineering and Deanof the Faculty of Physical Sciences and Engi-neering at University of Southampton, UK. Heis ARM Professor of Computer Engineeringand Co-Director of the ARM-ECS researchcentre. His research interests include methods,algorithms and design automation tools for en-ergy efficient of embedded computing systems.

He has published over 300 technical papers, authored or co-authored5 books and has graduated 31 PhD students.




Lajos Hanzo (FREng, FIEEE, FIET,Eurasip Fellow, RS Wolfson Fellow, www-mobile.ecs.soton.ac.uk) holds the Chairof Telecommunications at SouthamptonUniversity, UK. He co-authored 1577 IEEEXplore entries, 20 IEEE Press & Wiley books,graduated 100+ PhD students and has 24000+ citations.

Documents

SUBMIT TO IEEE ACCESS 1 Implementation of a Fully-Parallel ...that occur within the forward and backward recursions of the Log-BCJR algorithm, which is employed during turbo decoding