5
1318 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012 Transactions Briefs Novel Interpolation and Polynomial Selection for Low-Complexity Chase Soft-Decision Reed-Solomon Decoding Xinmiao Zhang, Yingquan Wu, Jiangli Zhu, and Yu Zheng Abstract—Algebraic soft-decision decoding (ASD) of Reed-Solomon (RS) codes can achieve substantial coding gain with polynomial complexity. Par- ticularly, the low-complexity Chase (LCC) ASD decoding has better perfor- mance-complexity tradeoff. In the LCC decoding, test vectors need to be interpolated over, and a polynomial selection scheme needs to be employed to select one interpolation output to send to the rest decoding steps. The interpolation and polynomial selection can account for a significant part of the LCC decoder area, especially in the case of long RS codes and large . In this paper, simplifications are first proposed for a low-complexity poly- nomial selection scheme. Then a novel interpolation scheme is developed by making use of the simplified polynomial selection. Instead of interpolating over each vector, our scheme first generates information necessary for the polynomial selection. Then only the selected vectors are interpolated over. The proposed interpolation and polynomial selection schemes can lead to 162% higher efficiency in terms of throughput-over-area ratio for an ex- ample LCC decoder with for a (458, 410) RS code over . Index Terms—Algebraic soft-decision decoding (ASD), interpolation, polynomial selection, Reed-Solomon (RS) codes, VLSI design. I. INTRODUCTION Reed-Solomon (RS) codes are widely adopted in digital communi- cation and storage systems. Algebraic soft-decision decoding (ASD) algorithms were proposed to incorporate the channel information into an interpolation-based decoding process. They can achieve significant coding gain with polynomial complexity. Among existing ASD algo- rithms, the low-complexity Chase (LCC) [1] decoding, which tests vectors of points with multiplicity one, can achieve better perfor- mance-complexity tradeoff. ASD algorithms share two major steps: the interpolation and factor- ization. Applying the re-encoding and coordinate transformation [2] to the LCC decoding, the number of points to be interpolated in each test vector can be reduced from to for an RS code. In addi- tion, as proved in [3], the factorization can be eliminated from the re-en- coded LCC decoding. Interpolation architectures based on the Kötter’s forward interpolation [4] were developed in [5]. Nevertheless, interpo- lating over each of the test vectors from the beginning leads to high complexity. A backward interpolation scheme was proposed in [6] to delete points from a given interpolation result. Accordingly, the interpo- lation result of a test vector with only one point different from the current vector can be derived by one backward and one Kötter’s forward inter- polation iterations. Moreover, these two interpolations can be unified in one single iteration [7]. To further reduce the latency, the parallel inter- polator in [8] can be used to generate multiple outputs at a time. Pro- cessing all interpolation outputs would lead to large computational requirement. Hence, polynomial selection schemes were developed in Manuscript received December 07, 2010; revised April 14, 2011; accepted April 26, 2011. Date of publication June 07, 2011; date of current version June 01, 2012. This work was supported by the National Science Foundation under Grant 0846331 and Grant 0802159. X. Zhang, J. Zhu, and Y. Zheng are with Case Western Reserve Uni- versity, Cleveland, OH 44106 USA (e-mail: [email protected]; [email protected]; [email protected]). Y. Wu is with Link_A_Media, Santa Clara, CA 95051 USA. Digital Object Identifier 10.1109/TVLSI.2011.2150254 [1], [7] to pick only one interpolation result to send to the remaining de- coding steps. Both of them are based on root search over finite fields. Despite all available techniques, the interpolation and root-search- based polynomial selection still account for a significant proportion of the LCC decoder area, especially when the finite field is large and/or parallel processing needs to be employed to reduce the latency. One such case is magnetic recoding, which usually requires a RS codeword length of 4 Kbits or longer and large . In [9], a low-complexity polynomial selection scheme was proposed for the re-encoded LCC decoder. By presetting one message symbol in the encoding process, the polynomial selection only needs to test whether the evaluation value of a polynomial constructed from the in- terpolation output over the preset point is zero. Although the encoder needs to be modified and one message symbol is sacrificed, this poly- nomial selection leads to great complexity reduction. In this paper, the polynomial selection in [9] is further simplified so that the zero testing can be done directly on the evaluation value of the interpolation output. Then a novel interpolation scheme is proposed by making use of the simplified polynomial selection. Our scheme first de- rives the evaluation value needed for the polynomial selection without going through the entire interpolation process for each test vector. Then only the selected vectors are interpolated over. Since a single evalua- tion value, instead of the whole interpolation output polynomial, is all that needed to tell whether the corresponding vector should be selected, our proposed scheme requires significantly lower complexity. In addi- tion, efficient architectures are developed to implement the proposed schemes. Based on synthesis results, the proposed interpolation and polynomial selection architecture requires 75% less area and achieves 37% higher throughput compared to the previous best design [8] for a (458, 410) LCC RS decoder over with . From hard- ware complexity analysis, the proposed design leads to 162% higher efficiency in the overall decoder in terms of throughput-over-area ratio. Furthermore, as increases, the saving that can be brought by the pro- posed design becomes more significant. II. LCC DECODING ALGORITHM This paper considers an RS code over finite field . ASD algorithms consist of multiplicity assignment, interpolation, and factor- ization steps, and are only different in the first step. Among available ASD algorithms, the LLC decoding can achieve better performance- complexity tradeoff. The LCC multiplicity assignment first selects the least reliable code positions. For each unreliable code position, two points, and , are assigned. Here is the field element used for the evaluation map encoding, and and are the hard-deci- sion and the second mostly likely symbol for the th position. For each of the rest positions, only is assigned. All these points have multiplicity one. test vectors are formed by choosing one point from each position, and decoding needs to be done for each vector. The interpolation finds a polynomial, , with minimum weighted degree that passes each point with its associ- ated multiplicity. To reduce the complexity of the interpolation, the re-encoding and coordinate transformation [4] have to be employed. Fig. 1 shows the re-encoded LCC decoder. Denote the received word by , and the set of the most reliable code positions by . The re-encoding is to generate a codeword , such that for . Then the errors can be found by decoding . Since for , the interpolation over the corresponding points can be eliminated by applying a coordinate transformation. 1063-8210/$26.00 © 2011 IEEE

Novel Interpolation and Polynomial Selection for Low-Complexity Chase Soft-Decision Reed-Solomon Decoding

Embed Size (px)

DESCRIPTION

Algebraic soft-decision decoding (ASD) of Reed-Solomon (RS)codes can achieve substantial coding gain with polynomial complexity. Particularly,the low-complexity Chase (LCC) ASD decoding has better performance-complexity tradeoff. In the LCC decoding, ? test vectors need to beinterpolated over, and a polynomial selection scheme needs to be employedto select one interpolation output to send to the rest decoding steps. Theinterpolation and polynomial selection can account for a significant part ofthe LCC decoder area, especially in the case of long RS codes and large .In this paper, simplifications are first proposed for a low-complexity polynomialselection scheme. Then a novel interpolation scheme is developed bymaking use of the simplified polynomial selection. Instead of interpolatingover each vector, our scheme first generates information necessary for thepolynomial selection. Then only the selected vectors are interpolated over.

Citation preview

  • 1318 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012

    Transactions BriefsNovel Interpolation and Polynomial Selection for

    Low-Complexity Chase Soft-DecisionReed-Solomon Decoding

    Xinmiao Zhang, Yingquan Wu, Jiangli Zhu, and Yu Zheng

    AbstractAlgebraic soft-decision decoding (ASD) of Reed-Solomon (RS)codes can achieve substantial coding gain with polynomial complexity. Par-ticularly, the low-complexity Chase (LCC) ASD decoding has better perfor-mance-complexity tradeoff. In the LCC decoding, test vectors need to beinterpolated over, and a polynomial selection scheme needs to be employedto select one interpolation output to send to the rest decoding steps. Theinterpolation and polynomial selection can account for a significant part ofthe LCC decoder area, especially in the case of long RS codes and large .In this paper, simplifications are first proposed for a low-complexity poly-nomial selection scheme. Then a novel interpolation scheme is developed bymaking use of the simplified polynomial selection. Instead of interpolatingover each vector, our scheme first generates information necessary for thepolynomial selection. Then only the selected vectors are interpolated over.The proposed interpolation and polynomial selection schemes can lead to162% higher efficiency in terms of throughput-over-area ratio for an ex-ample LCC decoder with for a (458, 410) RS code over .

    Index TermsAlgebraic soft-decision decoding (ASD), interpolation,polynomial selection, Reed-Solomon (RS) codes, VLSI design.

    I. INTRODUCTIONReed-Solomon (RS) codes are widely adopted in digital communi-

    cation and storage systems. Algebraic soft-decision decoding (ASD)algorithms were proposed to incorporate the channel information intoan interpolation-based decoding process. They can achieve significantcoding gain with polynomial complexity. Among existing ASD algo-rithms, the low-complexity Chase (LCC) [1] decoding, which tests

    vectors of points with multiplicity one, can achieve better perfor-mance-complexity tradeoff.

    ASD algorithms share two major steps: the interpolation and factor-ization. Applying the re-encoding and coordinate transformation [2] tothe LCC decoding, the number of points to be interpolated in each testvector can be reduced from to for an RS code. In addi-tion, as proved in [3], the factorization can be eliminated from the re-en-coded LCC decoding. Interpolation architectures based on the Kttersforward interpolation [4] were developed in [5]. Nevertheless, interpo-lating over each of the test vectors from the beginning leads to highcomplexity. A backward interpolation scheme was proposed in [6] todelete points from a given interpolation result. Accordingly, the interpo-lation result of a test vector with only one point different from the currentvector can be derived by one backward and one Ktters forward inter-polation iterations. Moreover, these two interpolations can be unified inone single iteration [7]. To further reduce the latency, the parallel inter-polator in [8] can be used to generate multiple outputs at a time. Pro-cessing all interpolation outputs would lead to large computationalrequirement. Hence, polynomial selection schemes were developed in

    Manuscript received December 07, 2010; revised April 14, 2011; acceptedApril 26, 2011. Date of publication June 07, 2011; date of current version June01, 2012. This work was supported by the National Science Foundation underGrant 0846331 and Grant 0802159.

    X. Zhang, J. Zhu, and Y. Zheng are with Case Western Reserve Uni-versity, Cleveland, OH 44106 USA (e-mail: [email protected];[email protected]; [email protected]).

    Y. Wu is with Link_A_Media, Santa Clara, CA 95051 USA.Digital Object Identifier 10.1109/TVLSI.2011.2150254

    [1], [7] to pick only one interpolation result to send to the remaining de-coding steps. Both of them are based on root search over finite fields.

    Despite all available techniques, the interpolation and root-search-based polynomial selection still account for a significant proportion ofthe LCC decoder area, especially when the finite field is large and/orparallel processing needs to be employed to reduce the latency. Onesuch case is magnetic recoding, which usually requires a RS codewordlength of 4 Kbits or longer and large .

    In [9], a low-complexity polynomial selection scheme was proposedfor the re-encoded LCC decoder. By presetting one message symbolin the encoding process, the polynomial selection only needs to testwhether the evaluation value of a polynomial constructed from the in-terpolation output over the preset point is zero. Although the encoderneeds to be modified and one message symbol is sacrificed, this poly-nomial selection leads to great complexity reduction.

    In this paper, the polynomial selection in [9] is further simplified sothat the zero testing can be done directly on the evaluation value of theinterpolation output. Then a novel interpolation scheme is proposed bymaking use of the simplified polynomial selection. Our scheme first de-rives the evaluation value needed for the polynomial selection withoutgoing through the entire interpolation process for each test vector. Thenonly the selected vectors are interpolated over. Since a single evalua-tion value, instead of the whole interpolation output polynomial, is allthat needed to tell whether the corresponding vector should be selected,our proposed scheme requires significantly lower complexity. In addi-tion, efficient architectures are developed to implement the proposedschemes. Based on synthesis results, the proposed interpolation andpolynomial selection architecture requires 75% less area and achieves37% higher throughput compared to the previous best design [8] for a(458, 410) LCC RS decoder over with . From hard-ware complexity analysis, the proposed design leads to 162% higherefficiency in the overall decoder in terms of throughput-over-area ratio.Furthermore, as increases, the saving that can be brought by the pro-posed design becomes more significant.

    II. LCC DECODING ALGORITHMThis paper considers an RS code over finite field .ASD

    algorithms consist of multiplicity assignment, interpolation, and factor-ization steps, and are only different in the first step. Among availableASD algorithms, the LLC decoding can achieve better performance-complexity tradeoff. The LCC multiplicity assignment first selects the least reliable code positions. For each unreliable code position, twopoints,

    and

    , are assigned. Here

    is the field elementused for the evaluation map encoding, and

    and

    are the hard-deci-sion and the second mostly likely symbol for the th position. For eachof the rest positions, only

    is assigned. All these pointshave multiplicity one. test vectors are formed by choosing one pointfrom each position, and decoding needs to be done for each vector.

    The interpolation finds a polynomial, , with minimum weighted degree that passes each point with its associ-ated multiplicity. To reduce the complexity of the interpolation, there-encoding and coordinate transformation [4] have to be employed.Fig. 1 shows the re-encoded LCC decoder. Denote the receivedword by , and the set of the most reliable code positions by .The re-encoding is to generate a codeword , such that

    for . Then the errors can be found by decoding .Since

    for , the interpolation over the correspondingpoints can be eliminated by applying a coordinate transformation.

    1063-8210/$26.00 2011 IEEE

  • IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012 1319

    Fig. 1. Block diagram of the re-encoded LCC decoder.

    Fig. 2. FERs of the LCC decoding for a (458, 410) RS code over EPR4-equal-ized magnetic recording channel with 100% AWGN.

    Accordingly, only points need to be interpolated over in eachtest vector. In the LCC decoding, each interpolation output is in theformat of

    . It was proved in [3] that

    and

    can be used as the locator and evaluator, respectively, tocorrect the errors in . Hence, the factorization step can be removed.Moreover, the entire codeword can be efficiently recovered using thefull Chien-search-based (FCSB) scheme in [10].

    The interpolation over each test vector can be carried out by theKtters forward interpolation [4], which is listed in Algorithm A inthe case of the LCC decoding. Passing all the interpolation outputsto the following decoder blocks would cause very high computationalcomplexity. Hence, polynomial selection schemes [1], [7] were devel-oped to pick only one . Both of them are based on root searchover finite fields, and are very hardware demanding. A novel polyno-mial selection was proposed for the re-encoded LCC decoder in [9].By presetting a message symbol in the encoder, the polynomial selec-tion can be carried out by testing the evaluation value of

    over

    , where

    .

    If

    , the corresponding is selected. Both

    and

    can be computed with simple hardware units. Although the en-coder needs to be modified and one message symbol is sacrificed, theproposed polynomial selection leads to great complexity reduction.

    Algorithm A: The Ktters Interpolation Algorithm

    Initialize:

    for Start: for each interpolation point

    A1: compute

    for A2:

    A3: A4:

    Output:

    Fig. 2 shows the frame error rates (FERs) of the LCC decoding fora (458, 410) RS code over when every vector is tested, andwhen the polynomial selection in [9] is adopted to pick the first twoqualified in the case that the vectors are tested according to re-ducing reliability. These simulations are carried out over EPR4-equal-ized magnetic recoding channel with 100% additive white Gaussiannoise (AWGN), and the modification to the LCC multiplicity assign-ment [11] is employed to improve the performance. This modificationadds common erasures to all test vectors and does not affect the rest

    decoding steps. Note that the code rate actually decreases to 409/458when the polynomial selection in [9] is used. This loss has been takeninto account in

    for fair comparison. As shown in Fig. 2, usingthe polynomial selection in [9] only leads to negligible performancedegradation. For reference, the curves of the hard-decision decoding(HDD) and Ktter-Vardy (KV) ASD decoding [12] with maximummultiplicity

    are also included in Fig. 2.

    III. REDUCED-COMPLEXITY POLYNOMIAL SELECTIONAND INTERPOLATION

    Although efficient architectures [6][8] have been developed basedon backward-forward LCC interpolation, the interpolation still occu-pies a significant proportion of the LCC decoder area, especially whenthe finite field is large and/or parallel processing needs to be employedto reduce the latency. In this section, after further simplifying the poly-nomial selection in [9], a novel scheme is developed to significantlyreduce the interpolation complexity.

    As mentioned previously,

    is tested for polynomial selectionin [9]. It can be derived that

    , where

    . Since

    ,

    whether

    is zero can be told by testing

    instead. Ac-cordingly, the polynomial selection can be done by picking whose corresponding

    is zero. This simplification does not af-fect the error correcting performance, and can save a finite field inver-sion. More importantly,

    can be derived by tracing the polyno-mial updating during the interpolation without knowing two separatevalues of

    and

    as needed in the

    computation. Thisproperty greatly facilitates the interpolation scheme proposed next.

    Algorithm B: Proposed interpolation schemeInitialize:

    for

    for

    Start:B1: Interpolate over the points with code positions in

    ;Update the initial values to derive

    ,

    for

    B2: Update evaluation values, derive

    for each vector;Select the first two test vectors with

    .

    B3: Interpolate the rest points for each selected vector.The simplified polynomial selection only needs to test

    . Hence,if

    can be derived without first building , the polyno-mial selection can be applied to first pick the test vectors and then theinterpolation only needs to construct for the selected vectors.Compared to interpolating over test vectors, this approach has thepotential to substantially lower the complexity of the interpolation stepin the LCC decoding.

    Algorithm B lists the proposed interpolation scheme for the re-en-coded LCC decoder assuming that the vectors are tested in the orderof decreasing reliability. The code positions in are furtherdivided into two sets,

    for the most unreliable code positions and

    for the rest code positions. Without loss of generality,assume

    and

    . To better explain the proposed scheme,subscript is added when necessary to to denotethe updated polynomials derived in the th interpolation iteration ofAlgorithm A. Accordingly,

    pass all points

    thathave been interpolated in iteration 0 through , and

    isthe value needs to be tested for polynomial selection. Instead of firstbuilding

    and then carrying out evaluation,

    can be derived through updating the initial values by following thepolynomial updating that should have been done for each interpolation

  • 1320 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012

    Fig. 3. Test vector tree.

    Fig. 4. Evaluation values need to be stored.

    iteration. The initial values can be derived easily as shown in AlgorithmB. Since the points in

    are common to all test vectors, theinterpolation over them is first done in Step B1 of Algorithm B usingthe same process as listed in Algorithm A. At the same time, the eval-uation values over

    are updated according to Step A3-A4 as

    (1)

    using the

    computed from the interpo-lation process. At the end of Step B1, the polynomials

    that pass all points in

    , and their evaluation values

    are derived.In Step B2,

    is computed for each test vectorwithout carrying out any interpolation. It can be derived throughiteratively updating

    using (1) if

    or

    for

    are available. Considering this, theevaluation values over

    and

    for

    are

    also initialized and updated in the same way as the evaluationvalue over

    during Step B1. Hence,

    and

    for

    are also available at the end of StepB1. Since

    are actually

    ,

    they can be used in an updating equation similar to (1) to derive

    for . Likewise,

    equal

    , and areused to update other evaluation values again. This process can beapplied iteratively to derive all

    and

    for

    , and accordingly

    . After the first two test vectorsare selected based on

    , the interpolation is continuedfrom the results of Step B1 to cover the points in

    for each selectedvector in Step B3.

    For a code position

    , each test vector can take

    or

    . Hence, the test vectors can be mapped to a binary tree asshown in Fig. 3 [6]. In this figure, each path from the root to a leaf rep-resents a test vector. In addition, 0 and 1 denote that

    and

    , respectively, is included in the vector. An edge starting from anode in level represents using

    or

    as

    to updateother evaluation values. If the evaluation values of a node are stored,they can be shared in the evaluation value updating (EVU) corre-sponding to the edges going to its children nodes. Since the evaluationvalues on a point is no longer needed after they are used to updateother evaluation values, only evaluation values needto be stored for a node in level as shown in Fig. 4. However, there are

    nodes in level , and hence using a breath-first scheme to traversethe tree would lead to large memory requirement. Instead, our design

    Fig. 5. Interpolator architecture. (a) PE unit. (b) PU unit.

    Fig. 6. Architecture of the EVU unit.

    adopts a depth-first scheme. When a node is reached, the updatedvalues can replace the previously stored values of the node in the samelevel. Accordingly, evaluation values only need to be remembered for nodes, one for each level.

    The proposed interpolation scheme requires two interpolators andone EVU unit. The interpolation over the common pointsin Step B1 is carried out by Interpolator 1. In Step B3, each of the twoselected test vectors has points in

    to be interpolated over. To reducethe latency, Interpolator 1 and 2 are employed to interpolate over thetwo test vectors in parallel. Moreover, the EVU unit is used to updatethe evaluation values in Step B1 and B2.

    Interpolator 1 can be implemented by using the architecture inFig. 5. It mainly consists of the polynomial evaluation (PE) unit forStep A1 and polynomial updating (PU) unit for Step A3A4. In the PEunit, Horners rule is applied and the coefficients of

    are inputserially with the most significant one first. Two copies of the PE unitsare employed to compute

    simultaneously.The PU architecture shown in Fig. 5(b) takes care of the PU for onepair of

    , and two copies of the PU units are used to update allpolynomial coefficients with the same -degree simultaneously. Sinceswitching and in the memory does not affect theinterpolation output, the updated coefficients are written back to fixedmemory blocks to save multiplexors. Moreover, once a coefficientis updated, it is sent to the PE unit to calculate the evaluation valuefor the next iteration. Interpolator 2 is only used to interpolate overthe last points of the second selected vector. During Step B2, theevaluation values required in this interpolation process have alreadybeen computed and stored. Hence, they can be passed to Interpolator2. Accordingly, Interpolator 2 only needs the PU units, and its arearequirement is much smaller than that of Interpolator 1.

    Fig. 6 shows the EVU architecture, which carries out the com-putations in (1). This architecture is very similar to the PU unit inFig. 5(b). However, the data are routed to memory blocks differently.At the beginning of Step B1, the initial evaluation values on the points are loaded into RAM 2 in Fig. 6. During Step B1, the evaluationvalues are updated in the EVU unit by using the from Interpolator1. All the updated values are written back to RAM 2, except that inthe last iteration, the evaluation values on

    and

    are stored into RAM 1. This is because thesevalues will be used as during the first EVU iteration in Step B2.Similarly, during each iteration of Step B2, the updated evaluationvalues of the first two points are written into RAM 1 to be used as in the next iteration, while the rest are written to RAM 2. Hence, allthe updated values in the solid circle of Fig. 4 are stored in RAM 1,and those in the dotted circle are stored in RAM 2.

  • IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012 1321

    IV. HARDWARE COMPLEXITY ANALYSES AND COMPARISONSThis section analyzes the hardware complexity of our polynomial

    selection and interpolation scheme and compares it with prior effortsfor a (458, 410) RS code with . Then the complexity reductionof the LCC decoder achieved by using our design is investigated.

    A. Hardware Complexity Analyses and Comparisons of thePolynomial Selection and Interpolation

    Since Step B1B3 are carried out serially, the latency of the proposedinterpolation architecture is the sum of those of the three steps. The in-terpolation in Step B1 over the common points in

    takes iterations. In addition, the PU can be carried out concurrently as the PEof the next interpolation iteration. Hence, the number of clock cyclesrequired by the interpolation in Step B1 is

    , where

    is the maximum -degree of the polynomials in the th iterationand

    denotes the interpolation pipelining latency. In the worst case,

    , and the interpolation latency of Step B1 is

    .

    In Step B1, the EVU unit is also activated and each updating iterationtakes clock cycles if one EVU unit is employed. Since

    starts from one and increases at most by one in each iteration, the in-terpolation runs faster than the EVU in the first several iterations. Toavoid delaying the interpolation, the computed from the interpo-lator are buffered in RAM 1 of Fig. 6. If the EVU catches up later,it needs to wait until the corresponding are computed. Moreover,Step B2 can start right after the last EVU iteration is completed. Ac-cordingly, the number of clock cycles required for Step B1 is

    .

    During Step B2, each edge of the binary tree in Fig. 3 will be passedonce in the worst case using our depth-first scheme. To reduce the la-tency, the binary tree can be divided into sub-trees and one unit can beemployed for each sub-tree to carry out the EVU in parallel. To bal-ance the load on each EVU unit, the tree should be divided as symmet-rically as possible. This can be done by splitting from the top nodethat has two children each time. The latency of Step B2 is decidedby that of the tallest sub-tree. Assume that the entire binary tree isdivided into sub-trees. Then the tallest tree has exactly one edgebetween the nodes in levels and for , and hasa full binary tree starting from the node in level . Since the evalua-tion values over points need to be updated for an edgeending at a level- node, the latency of Step B2 can be derived as

    .The last clock cycleis spent on testing

    for polynomial selection. When is large,

    is mainly decided by the term . Accordingly, increasing by one can reduce

    by almost a half. However, it will also doublethe number of EVU engines. Hence, the speed-area tradeoff needs tobe considered when choosing .

    In Step B3, the last points in the two selected vectors are inter-polated simultaneously by two interpolators, and this step consists of interpolation iterations. Similarly, in the worst case,

    . How-ever, in the last iteration, only the polynomial of lower weighted de-gree, , will be sent to the output, and thus only this polyno-mial needs to be updated. Since decoding a test vector can correct atmost errors, the -degree of is at most .Therefore, the number of clock cycles required in Step B3 is

    .

    Table I lists the hardware complexity of the proposed interpolationscheme for the LCC decoder of a (458, 410) RS code over with . Since the proposed polynomial selection only needs totest whether

    is zero, the corresponding hardware complexityis negligible and is not included in this table. After exploiting differentvalues, is set to one, i.e., two EVU units are employed, to increasehardware efficiency. Moreover,

    . Hence, the entire interpola-tion process takes

    clock cycles. Among previous

    TABLE IHARDWARE REQUIREMENT OF INTERPOLATION AND POLYNOMIAL SELECTION

    WITH FOR A (458, 410) RS CODE OVER

    TABLE IISYNTHESIS RESULTS OF INTERPOLATION AND POLYNOMIAL SELECTION WITH FOR A (458, 410) RS CODE OVER

    interpolator designs, the one in [8] is the most efficient for large .When 4-parallel processing is employed, it can finish the interpolationin 2828 clock cycles. To test four simultaneously, four poly-nomial selection engines are needed. Previously, highly parallel Chiensearch needs to be employed for each engine in order to match the inter-polation speed. Both the proposed design and that in [8] have the samecritical path. Hence the proposed interpolation and polynomial selec-tion can achieve % higher throughput. To furtherevaluate our design, it is modeled using Verilog-HDL and synthesizedusing Synopsys Design Compiler with SMIC 0.18-m CMOS tech-nology at 1.8 V power supply and 150 MHz clock frequency. More-over, Synopsys Power Compiler is used to estimate the power con-sumption, and the results are listed in Table II. In terms of throughput-over-area ratio, the proposed interpolator is 69% more efficient than theinterpolator in [8]. When the polynomial selection is considered, theproposed design in 457% more efficient. Our memory compiler onlygenerates memory with depth in the format of and eachmemory cell has eight transistors. Hence, much area is wasted on un-used memory portions when the proposed interpolator is synthesized.If a more optimized memory compiler is available, the proposed in-terpolator would occupy even less area. Although the proposed designneeds more memory than that in [8], the number of memory access issmaller, and logic gate requirement is much less. As a result, the pro-posed interpolator has much lower power consumption.

    The proposed polynomial selection is achieved through testing

    . Apparently, its complexity is negligible compared to that ofparallel exhaustive root search in prior polynomial selection schemes.The complexity of the proposed interpolation is also significantlylower than those of the architectures employing backward-forwardinterpolation [6][8]. In previous schemes, the interpolation resultof the first test vector is derived by forward interpolation. Then theresult for each of the following vectors can be derived by oneiteration of unified backward-forward interpolation. Since only oneinterpolation result needs to be stored, previous schemes have smallmemory requirement. Nevertheless, they need much more logic gatesdue to the parallel processing needs to be adopted. The latency ofinterpolating the first test vector in prior designs is about the sameas the sum of the latencies for Step B1 and B3 in our scheme. StepB2 needs EVU iterations in the worst case. Although it istwice the iteration number of unified backward-forward interpolation,the number of values need to be updated, and hence the number ofclock cycles required in each iteration is much smaller. The numberof evaluation values updated by a EVU unit in each iteration reducesfrom in the first level of the tree to one in the last level. On theother hand, the number of polynomial coefficients need to be updatedin each backward-forward interpolation remains at about .

  • 1322 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012

    TABLE IIIHARDWARE REQUIREMENT OF LCC DECODERS WITH FOR A (458,

    410) RS CODE OVER

    TABLE IVCOMPARISONS OF LCC DECODERS FOR A (458, 410) RS CODE

    is usually much smaller than . Hence the proposedinterpolation scheme has much shorter latency. As it can be seenfrom Table I, our design employing 2-parallel EVU can achieve evenhigher speed than the 4-parallel unified backward-forward interpolatorin [8]. Another advantage of our design is that parallel processingis less costly. Since the EVU engine only takes a small part of theinterpolation area, duplicating this unit leads to lower area overhead.

    B. Complexity Reduction in the Overall LCC DecoderThe LCC decoder can be pipelined to achieve higher throughput.

    To increase the hardware utilization efficiency, each pipelining stageshould be adjusted to have about the same latency. Based on this idea,pipelining can be applied according to the cutsets in Fig. 1.

    The most efficient architectures for the re-encoder and codeword re-covery can be found in [13] and [10], respectively. Their complexitiesare listed in Table III. The proposed interpolator has shorter latency.Hence, when it is adopted in the LCC decoder, besides including extraunits to compute

    , higher level parallel processing needs to be usedin the re-encoder. Moreover, our proposed scheme selects two inter-polation outputs. Since the latency of the FCSB codeword recovery in[10] is less than half of the interpolation latency, running the codewordrecovery twice would not affect the decoder throughput. Also the rootnumber of

    can be counted during the FCSB codeword recovery.If it matches

    , the corresponding output will be sent as thefinal decoding output. Using the equivalent gate estimation explainedin [9], it can be calculated from Table III that the LCC decoder with for the (458, 410) RS code employing the proposed interpola-tion and polynomial selection requires 49261 XOR gates, and can de-code a received word in 2067 clock cycles. Comparatively, the decoderemploying the most efficient previous designs needs 102547 XOR gatesand 2828 clock cycles to decode each word.

    To apply the proposed polynomial selection, the systematic encoderneeds to be modified [9]. The extra complexity needed for this modifi-cation is equivalent to 4316 XOR gates. For the purpose of fair compar-isons, this extra area requirement is added to the hardware complexityof the proposed design in Table IV. As shown in this table, for the de-coder with , employing our proposed scheme can reduce the arearequirement by

    % and increase the throughputby 37%. Hence, our design can lead to 162% higher efficiency in termsof speed-over-area ratio. The hardware complexities of the LCC de-coders with and 10 are also listed in Table IV. From Table IV,the proposed scheme can increase the decoder efficiency by 53% when , and 491% with .

    The error-correcting performance of the LCC decoder can be im-proved by using larger . As increases, the efficiency improvementcan be achieved by our proposed scheme becomes more significant. Theinterpolation latency is dominated by a term with . However, this termismultipliedby a larger factor in backward-forward-based interpolation.Hence the interpolation latency grows faster with in previous schemes.Alternatively, higher level parallel processing can be adopted to reducethe interpolation latency when is larger. As aforementioned, theoverhead for parallel processing in our proposed interpolation schemeis much less. In the case that more interpolation results are generatedat a time, more copies of the expensive parallel Chien search need tobe employed for the root-search-based polynomial selection. This willincrease the overall decoder area significantly. On the other hand, ourpolynomial selection only needs to test a single evaluation value foreach test vector. Its complexity is negligible in the LCC decoder.

    V. CONCLUSIONIn this paper, a low-complexity polynomial selection scheme for the

    re-encoded LCC decoder is further optimized. Based on the optimizedpolynomial selection, a novel interpolation scheme was developed. Dif-ferent from conventional designs, test vectors are first selected and thenthe interpolation is carried out only on the chosen vectors. In addition,efficient interpolation architectures are developed. Compared to pre-vious efforts, our proposed scheme leads to significant speedup andarea reduction. Also the saving can be brought by our design furtherincreases with . Future work will be directed to further improving theefficiency of the LCC decoder.

    REFERENCES[1] J. Bellorado and A. Kavcic, Low-complexity soft-decoding algo-

    rithms for Reed-Solomon codesPart I: An Algebraic soft-in hard-outchase decoder, IEEE Trans. Inf. Theory, vol. 56, no. 3, pp. 945959,Mar. 2010.

    [2] W. J. Gross, F. R. Kschischang, R. Koetter, and P. Gulak, A VLSI ar-chitecture for interpolation in soft-decision decoding of Reed-Solomoncodes, in Proc. SiPS, 2002, pp. 3944.

    [3] J. Zhu and X. Zhang, Factorization-free low-complexity Chase soft-decision decoding of Reed-Solomon codes, presented at the ISCAS,Taipei, Taiwan, 2009.

    [4] R. Koetter, On algebraic decoding of algebraic-geometric andcyclic codes, Ph.D. dissertation, Dept. Elect. Eng., Linkoping Univ.,Linkoping, Sweden, 1996.

    [5] Z. Wang and J. Ma, High-speed interpolation architecture for soft-decision decoding of Reed-Solomon codes, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 14, no. 9, pp. 937950, Sep. 2006.

    [6] J. Zhu, X. Zhang, and Z. Wang, Backward interpolation architecturefor algebraic soft-decision Reed-Solomon decoding, IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 17, no. 11, pp. 16021615, Nov.2009.

    [7] X. Zhang and J. Zhu, Algebraic soft-decision decoder architectures forlong Reed-Solomon codes, IEEE Trans. Circuits Syst. II, Exp. Briefs,vol. 57, no. 10, pp. 787792, Oct. 2010.

    [8] X. Zhang and J. Zhu, Reduced-complexity multi-interpolator alge-braic soft-decision Reed-Solomon decoder, in Proc. SiPS, 2010, pp.402407.

    [9] X. Zhang, Y. Wu, and J. Zhu, A novel polynomial selection schemefor low-complexity Chase algebraic soft-decision Reed-Solomon de-coding, presented at the ISCAS, Rio de Janeiro, Brazil, 2011.

    [10] X. Zhang and Y. Zheng, Efficient codeword recovery architecture forlow-complexity Chase Reed-Solomon decoding, presented at the ITAWorkshop, San Diego, CA, 2011.

    [11] X. Zhang, J. Zhu, and W. Zhang, Modified low-complexity Chasesoft-decision decoder of Reed-Solomon codes, Springer J. SignalProcess. Syst., to be published.

    [12] R. Koetter and A. Vardy, Algebraic soft-decision decoding ofReed-Solomon codes, IEEE Trans. Inf. Theory, vol. 49, no. 11, pp.28092825, Nov. 2003.

    [13] J. Zhu and X. Zhang, High-speed re-encoder design for algebraicsoft-decision Reed-Solomon decoding, presented at the ISCAS, Paris,France, May 2010.