6
168 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013 REFERENCES [1] K. Takeuchi, G. Tanaka, H. Matsushita, K. Yoshizumi, Y. Katsuki, and T. Sato, “Observations of supply-voltage-noise dispersion in sub-nsec,” presented at the ITC, Santa Clara, CA, 2008, 26.3. [2] J. Wang, D. Walker, A. Majhi, B. Kruseman, G. Gronthoud, L. E. Villagra, P. Wiel, and S. Eichenberger, “Power supply noise in delay testing,” presented at the ITC, Santa Clara, CA, 2006, 17.3. [3] E. Alon, V. Stojanovic, and M. A. Horowitz, “Circuits and techniques for high-resolution measurement of on-chip power supply noise,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 820–828, Apr. 2005. [4] E. Alon, V. Abramzon, B. Nezamfar, and M. Horowitz, “On-die power supply noise measurement techniques,” IEEE Trans. Adv. Packag., vol. 32, no. 2, pp. 248–259, Feb. 2009. [5] M. Takamiya and M. Mizuno, “A sampling oscilloscope macro toward feedback physical design methodology,” in Proc. Symp. VLSI Circuits, 2004, pp. 240–243. [6] M. Nagata, “On-chip measurements complementary to design flow for integrity in SoCs,” in Proc. DAC, 2007, pp. 400–403. [7] K. Shimazaki, M. Fukazawa, M. Nagata, S. Miyahara, M. Hirata, K. Sato, and H. Tsujikawa, “An integrated timing and dynamic supply noise verification for nano-meter CMOS SoC designs,” in Proc. CICC, 2005, pp. 31–34. [8] Y. Kanno, Y. Kondoh, T. Irita, K. Hirose, R. Mori, Y. Yasu, S. Ko- matsu, and H. Mizuno, “In-situ measurement of supply-noise maps with millivolt accuracy and nanosecond-order time resolution,” in Proc. VLSI Circuits, 2006, pp. 63–64. [9] J. R. Vazquez and J. P. de Gyvez, “Power supply noise monitor for signal integrity faults,” in Proc. DATA, 2004, pp. 1406–1407. [10] S. Naffziger, B. Stackhouse, T. Grutkowski, D. Josephson, J. Desai, E. Alon, and M. H. Horowitz, “The implementation of a 2-core, multi- threaded itanium family processor,” IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 197–209, Jan. 2005. [11] M. Fukazawa, T. Matsuno, T. Uemura, R. Akiyama, T. Kagemoto, H. Makino, H. Takata, and M. Nagata, “Fine-grained in-circuit contin- uous-time probing technique of dynamic supply variations in SoCs,” in Proc. ISSCC, 2007, pp. 288–289. [12] S. Pant and E. Chiprout, “Power grid physics and implications for CAD,” in Proc. DAC, 2006, pp. 199–204. [13] J. Rearick and R. Rodgers, “Calibrating clock stretch during AC scan testing,” presented at the ITC, Austin, TX, 2005, 11.3. [14] K. Takeuchi, A. Yoshikawa, M. Komoda, K. Kotani, H. Matsushita, Y. Katsuki, Y. Yamamoto, and T. Sato, “Clock-skew test module for exploring reliable clock-distribution under process and global voltage- temperature variations,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 11, pp. 1559–1566, Nov. 2008. Low-Complexity Multiplier for Based on All-One Polynomials Jiafeng Xie, Pramod Kumar Meher, and Jianjun He Abstract—This paper presents an area-time-efficient systolic structure for multiplication over based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has the same input operand, and they can share the same input operand registers. From the application-spe- cific integrated circuit and field-programmable gate array synthesis results we find that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs. Index Terms—All-one polynomial, finite field, systolic design. I. INTRODUCTION Finite field multipliers over have wide applications in el- liptic curve cryptography (ECC) and error control coding systems [1], [2]. Polynomial basis multipliers are popularly used because they are relatively simple to design, and offer scalability for the fields of higher orders. Efficient hardware design for polynomial-based multiplication is therefore important for real-time applications [3]–[5]. All-one poly- nomial (AOP) is one of the classes of polynomials considered suitable to be used as irreducible polynomial for efficient implementation of finite field multiplication. Multipliers for the AOP-based binary fields are simple and regular, and therefore, a number of works have been explored on its efficient realization [6]–[17]. Irreducible AOPs are not abundant. They are very often not preferred in cryptosystems for se- curity reasons, and one has to make careful choice of the field order to use irreducible AOPs for cryptographic applications [1], [9]. The AOP-based multipliers can be used for the nearly AOP (NAOP) which could be used for efficient realization of ECC systems [18]. AOP-based fields could also be used for efficient implementation of Reed-Solomon encoders [19]. Besides, the AOP-based architectures can be used as a kernel circuit for field exponentiation, inversion, and division architec- tures [20]–[23]. Systolic design is a preferred type of specialized hardware solu- tion due to its high-level of pipeline ability, local connectivity and many other advantageous features [24], [25]. In [13], a bit-parallel AOP-based systolic multiplier has been suggested by Lee et al.. An- other efficient systolic design is presented in [14]. In a recent paper [15], a low-complexity bit-parallel systolic Montgomery multiplier has been suggested. Very recently [16], an efficient digit-serial systolic Montgomery multiplier for AOP-based binary extension field is pre- sented. The systolic structures for field multiplication have two major is- sues. First, the registers in the systolic structures usually consume large Manuscript received May 06, 2011; revised October 24, 2011 and December 08, 2011; accepted December 15, 2011. Date of publication January 12, 2012; date of current version December 19, 2012. This work was supported by Na- tional Science Foundation (NSF) of China under Grant 60843002, 61174132 and by Natural Science Program of Hunan Province under Grant 09JJ6098. J. Xie and J. He are with the School of Information Science and Engineering, Central South University, Changsha 410083, China (e-mail: [email protected]; [email protected]). P. Kumar Meher is with the Department of Embedded Systems, Institute for Infocomm Research, Singapore 138632 (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2011.2181434 1063-8210/$31.00 © 2012 IEEE

06129532

Embed Size (px)

DESCRIPTION

Low-Complexity Multiplier for Gm(2^m) Based onAll-One Polynomials

Citation preview

Page 1: 06129532

168 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

REFERENCES

[1] K. Takeuchi, G. Tanaka, H. Matsushita, K. Yoshizumi, Y. Katsuki, andT. Sato, “Observations of supply-voltage-noise dispersion in sub-nsec,”presented at the ITC, Santa Clara, CA, 2008, 26.3.

[2] J. Wang, D. Walker, A. Majhi, B. Kruseman, G. Gronthoud, L. E.Villagra, P. Wiel, and S. Eichenberger, “Power supply noise in delaytesting,” presented at the ITC, Santa Clara, CA, 2006, 17.3.

[3] E. Alon, V. Stojanovic, and M. A. Horowitz, “Circuits and techniquesfor high-resolution measurement of on-chip power supply noise,” IEEEJ. Solid-State Circuits, vol. 40, no. 4, pp. 820–828, Apr. 2005.

[4] E. Alon, V. Abramzon, B. Nezamfar, and M. Horowitz, “On-die powersupply noise measurement techniques,” IEEE Trans. Adv. Packag., vol.32, no. 2, pp. 248–259, Feb. 2009.

[5] M. Takamiya and M. Mizuno, “A sampling oscilloscope macro towardfeedback physical design methodology,” in Proc. Symp. VLSI Circuits,2004, pp. 240–243.

[6] M. Nagata, “On-chip measurements complementary to design flow forintegrity in SoCs,” in Proc. DAC, 2007, pp. 400–403.

[7] K. Shimazaki, M. Fukazawa, M. Nagata, S. Miyahara, M. Hirata, K.Sato, and H. Tsujikawa, “An integrated timing and dynamic supplynoise verification for nano-meter CMOS SoC designs,” in Proc. CICC,2005, pp. 31–34.

[8] Y. Kanno, Y. Kondoh, T. Irita, K. Hirose, R. Mori, Y. Yasu, S. Ko-matsu, and H. Mizuno, “In-situ measurement of supply-noise mapswith millivolt accuracy and nanosecond-order time resolution,” in Proc.VLSI Circuits, 2006, pp. 63–64.

[9] J. R. Vazquez and J. P. de Gyvez, “Power supply noise monitor forsignal integrity faults,” in Proc. DATA, 2004, pp. 1406–1407.

[10] S. Naffziger, B. Stackhouse, T. Grutkowski, D. Josephson, J. Desai, E.Alon, and M. H. Horowitz, “The implementation of a 2-core, multi-threaded itanium family processor,” IEEE J. Solid-State Circuits, vol.41, no. 1, pp. 197–209, Jan. 2005.

[11] M. Fukazawa, T. Matsuno, T. Uemura, R. Akiyama, T. Kagemoto, H.Makino, H. Takata, and M. Nagata, “Fine-grained in-circuit contin-uous-time probing technique of dynamic supply variations in SoCs,”in Proc. ISSCC, 2007, pp. 288–289.

[12] S. Pant and E. Chiprout, “Power grid physics and implications forCAD,” in Proc. DAC, 2006, pp. 199–204.

[13] J. Rearick and R. Rodgers, “Calibrating clock stretch during AC scantesting,” presented at the ITC, Austin, TX, 2005, 11.3.

[14] K. Takeuchi, A. Yoshikawa, M. Komoda, K. Kotani, H. Matsushita,Y. Katsuki, Y. Yamamoto, and T. Sato, “Clock-skew test module forexploring reliable clock-distribution under process and global voltage-temperature variations,” IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 16, no. 11, pp. 1559–1566, Nov. 2008.

Low-Complexity Multiplier for Based onAll-One Polynomials

Jiafeng Xie, Pramod Kumar Meher, and Jianjun He

Abstract—This paper presents an area-time-efficient systolic structurefor multiplication over �� � based on irreducible all-one polynomial(AOP). We have used a novel cut-set retiming to reduce the duration ofthe critical-path to one XOR gate delay. It is further shown that the systolicstructure can be decomposed into two or more parallel systolic branches,where the pair of parallel systolic branches has the same input operand, andthey can share the same input operand registers. From the application-spe-cific integrated circuit and field-programmable gate array synthesis resultswe find that the proposed design provides significantly less area-delay andpower-delay complexities over the best of the existing designs.

Index Terms—All-one polynomial, finite field, systolic design.

I. INTRODUCTION

Finite field multipliers over �� ���� have wide applications in el-liptic curve cryptography (ECC) and error control coding systems [1],[2]. Polynomial basis multipliers are popularly used because they arerelatively simple to design, and offer scalability for the fields of higherorders. Efficient hardware design for polynomial-based multiplicationis therefore important for real-time applications [3]–[5]. All-one poly-nomial (AOP) is one of the classes of polynomials considered suitableto be used as irreducible polynomial for efficient implementation offinite field multiplication. Multipliers for the AOP-based binary fieldsare simple and regular, and therefore, a number of works have beenexplored on its efficient realization [6]–[17]. Irreducible AOPs are notabundant. They are very often not preferred in cryptosystems for se-curity reasons, and one has to make careful choice of the field orderto use irreducible AOPs for cryptographic applications [1], [9]. TheAOP-based multipliers can be used for the nearly AOP (NAOP) whichcould be used for efficient realization of ECC systems [18]. AOP-basedfields could also be used for efficient implementation of Reed-Solomonencoders [19]. Besides, the AOP-based architectures can be used as akernel circuit for field exponentiation, inversion, and division architec-tures [20]–[23].

Systolic design is a preferred type of specialized hardware solu-tion due to its high-level of pipeline ability, local connectivity andmany other advantageous features [24], [25]. In [13], a bit-parallelAOP-based systolic multiplier has been suggested by Lee et al.. An-other efficient systolic design is presented in [14]. In a recent paper[15], a low-complexity bit-parallel systolic Montgomery multiplier hasbeen suggested. Very recently [16], an efficient digit-serial systolicMontgomery multiplier for AOP-based binary extension field is pre-sented.

The systolic structures for field multiplication have two major is-sues. First, the registers in the systolic structures usually consume large

Manuscript received May 06, 2011; revised October 24, 2011 and December08, 2011; accepted December 15, 2011. Date of publication January 12, 2012;date of current version December 19, 2012. This work was supported by Na-tional Science Foundation (NSF) of China under Grant 60843002, 61174132and by Natural Science Program of Hunan Province under Grant 09JJ6098.

J. Xie and J. He are with the School of Information Science and Engineering,Central South University, Changsha 410083, China (e-mail: [email protected];[email protected]).

P. Kumar Meher is with the Department of Embedded Systems, Institute forInfocomm Research, Singapore 138632 (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2011.2181434

1063-8210/$31.00 © 2012 IEEE

Page 2: 06129532

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013 169

area and power. Second, the systolic structures usually have a latencyof nearly � cycles, which is very often undesired for real-time ap-plications. Therefore, in this paper, we have presented a novel reg-ister-sharing technique to reduce the register requirement in the systolicstructure. The proposed algorithm not only facilitates sharing of regis-ters by the neighboring PEs to reduce the register complexity but alsohelps reducing the latency. Cut-set retiming allows to introduce certainnumber of delays on all the edges in one direction of any cut-set of asignal flow-graph (SFG) by removing equal number of delays on allthe edges in the reverse direction of the same cut-set [24]. When all theedges are in a single direction, one can introduce any desired numberof delays on all the edges of any cut-set of an SFG. Therefore, thistechnique is highly useful for pipelining digital circuits to reduce thecritical path. In this paper, we have proposed a novel cut-set retimingapproach to reduce the clock-period. The proposed structure is found toinvolve significantly less area-time-power complexity compared withthe existing designs.

The rest of this paper is organized as follows. The proposed algo-rithm for finite field multiplication over �� ���� based on AOP is de-rived in Section II. In Section III, the proposed structure is presented.In Section IV, we have listed the complexities and compared themwith those of the existing structures. Finally the conclusion is givenin Section V.

II. ALGORITHM

Let ���� � �� � ���� � � � � � � � � be an irreducible AOPof degree m over �� ���. As a requirement of irreducible AOP for�� ����� �� � �� is prime and 2 is the primitive modulo �� � ��.The set ������ ����� � � � � �� �� forms the polynomial basis (where� is a root of ����), such that an element � of the binary field can begiven by

� � �������� ������

��� � � � �������� (1)

where �� � �� ��� for � � � �� � � � � �� �� �.Since � is a root of ����, we can have ���� � �, and

���� � ����� � ��� � ���� � � � �� �� ��

� ���� � ���� � � � �� �� ��

����� � � � � (2)

Therefore, we have

���� � � (3)

This property of AOP [17] is used to reduce the complexity of fieldmultiplications as discussed in the following.

Any element � in �� ���� given by (1) in polynomial basis repre-sentation can be represented as � � �� � ���� � � � � ���

�, where�� � �� ���, and ���� ����� � � � � �� �� is the extended polynomialbasis [17]. Similarly, if ���� � �� ����, they can be representedby the extended polynomial basis as

� �

���

����� � �

���

����� �

���

���� (4)

where �� , �� , and �� � �� ���, for � � � � � � �, and �� � �,�� � �, and �� � �.

If is the product of elements � and �, then we can have

� � � � � ���� (5)

which can be decomposed to a form

���

�� �� � �� ���� (6)

Equation (6) can be expressed as a finite field accumulation

���

�� (7)

where �� is given by

�� � �� � �� (8a)

for �� � �, and �� � ���� � ���� , and using (3) �� can beobtained from � as

�� � �����

� � ����������� � � �� �������� ������ (8b)

Such that ���� can be obtained from �� recursively as

���� � � � �� � ���� (9)

The partial product generation and modular reduction are performedaccording to (8) and (9), respectively. The additions of the reducedpolynomials are performed according to (7).

Equation (9) can be expressed as

���� � �

�� � �� �

�� � �

� � � � �� ��� � ���� � ���� (10a)

where

�� �

���

����

� (10b)

Substituting (3) into (10a), ���� can be obtained as

���� � �

���� � �

���� � �� � � �� �

���� � �� (11a)

where

����� � �

�� (11b)

����� � �

����� for � � � � �� � (11c)

It is also possible to extend (11) further to obtain ���� directly from�� for � � � � �, such that

����� �

���������� for � � � � �� �

������ otherwise.(12)

We have used the above equations to derive the proposed linear sys-tolic structure based on a novel cut-set retiming strategy and register-sharing technique.

III. PROPOSED STRUCTURE

In this section, we derive a basic systolic design followed by theproposed register sharing structure.

A. Basic Systolic Design

For systolic implementation of multiplication over�� ����, the op-erations of (7), (8) and (11) can be performed recursively. Each recur-sion is composed of three steps, i.e., modular reduction of (11), bit-mul-tiplication of (8), and bit-addition of (7). Equations of (7), (8) and (11)can be represented by the SFG (shown in Fig. 1) consisting of � mod-ular reduction nodes ��� and � addition nodes ��� for � � � �,

Page 3: 06129532

170 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

Fig. 1. SFG of the algorithm. (a) The SFG. (b) Function of node ����. (c)Function of node ����. (d) Function of node ����.

Fig. 2. Cut-set retiming of the SFG. (a) Cut-set retiming in a general way. (b)Proposed cut-set retiming. (c) Formation of PE. “D” denotes unit delay.

and ����� multiplication nodes���� for � � � � ���. The func-tions of these nodes are shown in Fig. 1(b)–(d). Node ���� performsthe modular reduction of degree by one according to (11). Node ����performs an AND operation of a bit of operand � with a reduced formof operand �, according to (8). Node ���� performs the bit-additionoperation according to (7), as shown in Fig. 1(d), where �� is the par-tial result available to the node.

Generally, we can introduce a delay between the reduction node andits corresponding bit-multiplication and bit-addition nodes, as shownin Fig. 2(a), such that the critical-path is not larger than ��� � ���,where the �� and �� refer the propagation delay of AND gate andXOR gate, respectively. In this section, however, we introduce a novelcut-set retiming to reduce the critical-path of a PE to �� . It is observedthat the node ���� performs only the bit-shift operation according to(11), and therefore it does not involve any time consumption. There-fore, we introduce a critical-path which is not larger than �� , as shownin Fig. 2(b). To derive the basic design of a systolic multiplier, we have

Fig. 3. Proposed systolic structure. (a) Systolic design. (b) Function of PE[0].(c) Function of PE[1]. (d) Function of regular PE (from PE[2] to ����� ��).(e) Function of �����. (f) Function of ����� ��.

Fig. 4. Structure of PEs. (a) Internal structure of a regular PE. (b) Internal struc-ture of PE[0] of Fig. 4. (c) An example of AND cell for� . (d) Structure ofthe AC. (e) Structure of BSC where� . (f) Alternate structure of a regularPE. (g) Alternate structure of PE[0].

shown the formation of PE of the retimed SFG in Fig. 2(c). It can beobserved that the cut-set retiming allows to perform a reduction opera-tions, bit-addition, and bit-multiplication concurrently, so that the crit-ical-path is reduced to ������ �� ���, where ��, �� and �� are,respectively, the computation times of the bit-addition nodes, bit-mul-tiplication nodes, and reduction nodes.

The basic design of systolic multiplier thus derived is shown inFig. 3. It consists of �� � �� PEs, and the functions of the PEs areshown in Fig. 3. During each cycle period, the regular PE (from PE[2]to �� � ��) not only performs the modular reduction operationaccording to (11), but also performs the bit-multiplication and bit-ad-dition operations concurrently. The detail circuit of a regular PE isshown in Fig. 4.

The regular PE, as shown in Fig. 4(a), consists of three basic cells,e.g., the bit-shift cell (BSC), the AND cell, and the XOR cell. The AND

cell, and the XOR cell correspond to the node ����, and node ����of the SFG of Fig. 1, respectively. The structure of PE[1] of Fig. 3 isshown in Fig. 4(b). It consists of an AND cell and a BSC. Each XOR cellsand AND cells in the PE consists of ��� �� number of gates workingin parallel. Fig. 4(c) shows an example of AND cell for � �. The

Page 4: 06129532

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013 171

Fig. 5. Proposed low latency systolic structure. (a) The systolic structure. (b)Function of the AC.

���� � �� of the systolic structure in Fig. 3 consists of only an XOR

cell, as shown in Fig. 4(d), which performs bit-by-bit XOR operationsof its pair of�-bit inputs. The BSC in the PE performs the bit-shift op-eration according to (11). We have shown an example of the structureof BSC (of PE[1] of Fig. 4) in Fig. 4(e) for � � �. Note that accordingto (12), one can obtain �� directly from �� for � � � � �, i.e., everyPE of the structure of Fig. 3 can have the same input operand ��, and�� can be obtained from the BSC after �� is fed as input. Therefore,we can change the circuit-designs of Fig. 4(a) and (b) into the form ofFig. 4(f) and (g), respectively. Besides, according to (11), the operationof node �� does not involve any area and time-consumption. There-fore, the minimum duration of clock-period of a regular PE amountsto �� ���� ��� � �� . The proposed systolic design yields the firstoutput of desired product �� � cycles after the first input is fed tothe structure, while the successive outputs are available in each cycle.

B. Shared-Register Low-Latency Systolic Structure

For irreducible AOP, � is an even number. Therefore, let � and � betwo integers such that �� � � �� � , where is an integer in therange � � � �. For example, if we choose � � ��, then � � �, � �, (7) can be rewritten as

� �

���

���

�� �

�������

�� (13)

As shown in (13), one of the sum contains ������ partial prod-ucts while the other has �� partial products. Based on (13), the sys-tolic structure of Fig. 4 could be modified to a form shown in Fig. 5,which consists of two systolic branches. The upper branch consists of��� � �� PEs and the lower branch consists of �� � � PEsand a delay cell. Besides, an addition-cell (AC) is required to performthe final addition of the outputs of the two systolic arrays, as shown inFig. 5(b). The structure has the PEs of the same complexity as those inFig. 3, but the latency of structure is only ��� � �� cycles.

It is observed that the two systolic branches in Fig. 5 share the sameinput operand�, and the PEs in both the branches perform the same op-eration except the last PE in each of the branches. Therefore, we presentan efficient structure using the register-sharing technique as shown inFig. 6, where the structure consists of ��� � �� PEs and an AC.The circuit of its regular PE (from PE[2] to ������ �� is shown inFig. 6(c). It combines two regular PEs of Fig. 5(a) together by sharingone input-operand-transfer. The other PEs need some minor modifi-cations, as shown in Fig. 6(b), (d) and (e), respectively. The functionof AC is the same as that in Fig. 5. Thus, the whole structure requiresonly � ��� � � �� � � bit-registers, while the structure of Fig. 4requires �������� bit-registers. Besides, the latency of structureis ��� � �� cycles, while the duration of cycle period of a regularPE is still �� .

Fig. 6. Low-latency register-sharing systolic structure. (a) The systolicstructure. (b) Structure of PE[1]. (c) Structure of a regular PE (from PE[2] to������� ��). (d) Structure of �������. (e) Structure of ������� ��.

Fig. 7. Improved low-latency systolic structure. (a) The proposed systolic arraymerging. (b) Improved systolic structure.

We may further decompose the design in Fig. 6. For example, if wechoose � � ��, then � � �, � �, (7) can be rewritten as

� �

�����

���

�� �

�����

�����

�� �

������

�����

�� �

������

�� (14)

Following the same approach as the one used to derive the structureof Fig. 5, we can have the design in Fig. 7(a), where it consists of foursystolic branches. Similarly, following the approach presented to derivethe structure of Fig. 6 from Fig. 5, we may have the design shown inFig. 7(b). The design of Fig. 7(b) requires only ��� � �� cycles oflatency. When � is a large number, � and � can be chosen as

� � � � ��� �� (15)

to obtain an optimal realization.

IV. HARDWARE AND TIME COMPLEXITY

The proposed structure (see Fig. 6) requires ������PEs and oneAC. Each of the regular PEs consists of ��� � XOR gates in a pairof XOR cells and ��� � AND gates in a pair of AND cells. Besides,the AC requires �� � XOR gates. Moreover, � ��� � � ��� �bit-registers are required for transferring data to the nearby PE. Thelatency of the design is ��� � �� cycles, where the duration of theclock-period is �� . The structure of Fig. 7 requires nearly the samegate-counts as that of Fig. 6. But its latency is ������ cycles. Thenumber of gates, latency and critical-path of the proposed designs (seeFigs. 6 and 7) and the existing designs of [11]–[16] are listed in Table I.

It can be seen that the proposed design outperforms the existing de-signs. Although slightly more registers than that in [11] are used, pro-posed design requires shorter latency and lower critical-path than theother as well as the MUX gates. The digit-serial structures of [12] and

Page 5: 06129532

172 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

TABLE IAREA AND TIME COMPLEXITY

� : delay of a 3-input XOR gate.� � : the delay of a T flip flop.: For the digit-serial structure, � is the digit size.� : the delay of a 2:1 MUX.

TABLE IICOMPARISON OF AREA AND TIME COMPLEXITY FOR � � ��

��� � ����� ������ �

���� �� ����� ����� �� ����� � ��� ��.Digit-size � � �.

TABLE IIIFPGA SYNTHESIS RESULT OF PROPOSED AND EXISTING DESIGNS

Power consumption (PC).Logic element (LE).Digit-size � � �.

[16] yield one product word in ��� and ������ �� clock-periods, re-spectively, while the proposed structure produces one product word inevery clock-period. Besides, as shown in Fig. 7, the proposed designcan be extended further to obtain a more efficient design for high-speedimplementation, especially when � is a large number.

The proposed design (see Fig. 7) has been coded in VHDL and syn-thesized by Synopsys Design Compiler using TSMC 90-nm libraryfor � � �� along with the bit-parallel systolic design of [15] anddigit-serial systolic structure of [16]. The average computation time(ACT), area and power consumption (at 100 MHz frequency) thus ob-tained are listed in Table II. The proposed design has at least 28.5%less area-delay product (ADP) and 28.2% lower power-delay product(PDP) compared to the existing ones.

Besides, we have synthesized the proposed design (see Fig. 7) andthe designs of [15] and [16] for� � �� and implemented on an AlteraFPGA: Cyclone-II EP2C15AF256A7 using Quartus II 9.0. From thesynthesis result, as shown in Table III, we find that the proposed designhas lower ADP and less PDP than the existing ones.

V. CONCLUSION

Efficient systolic design for the multiplication over �� ���� basedon irreducible AOP is proposed. By novel cut-set retiming we havebeen able to reduce the critical path to one XOR gate delay and bysharing of registers for the input-operands in the PEs, we have derived

a low-latency bit-parallel systolic multiplier. Compared with the ex-isting systolic structures for bit-parallel realization of multiplicationover �� ����, the proposed one is found to involve less area, shortercritical-path and lower latency. From ASIC and FPGA synthesis resultswe find that the proposed design involves significantly less ADP andPDP than the existing designs. Moreover, our proposed design can beextended to further reduce the latency.

REFERENCES

[1] M. Ciet, J. J. Quisquater, and F. Sica, “A secure family of compositefinite fields suitable for fast implementation of elliptic curve cryptog-raphy,” in Proc. Int. Conf. Cryptol. India, 2001, pp. 108–116.

[2] H. Fan and M. A. Hasan, “Relationship between �� �� � mont-gomery and shifted polynomial basis multiplication algorithms,” IEEETrans. Computers, vol. 55, no. 9, pp. 1202–1206, Sep. 2006.

[3] C.-L. Wang and J-L. Lin, “Systolic array implementation of multipliersfor finite fields �� �� �,” IEEE Trans. Circuits Syst., vol. 38, no. 7,pp. 796–800, Jul. 1991.

[4] B. Sunar and C. K. Koc, “Mastrovito multiplier for all trinomials,”IEEE Trans. Comput., vol. 48, no. 5, pp. 522–527, May 1999.

[5] C. H. Kim, C.-P. Hong, and S. Kwon, “A digit-serial multiplier forfinite field �� �� �,” IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 13, no. 4, pp. 476–483, 2005.

[6] C. Paar, “Low complexity parallel multipliers for Galois fields�� ��� � � based on special types of primitive polynomials,” in Proc.IEEE Int. Symp. Inform. Theory, 1994, p. 98.

[7] H. Wu, “Bit-parallel polynomial basis multiplier for new classes of fi-nite fields,” IEEE Trans. Computers, vol. 57, no. 8, pp. 1023–1031,Aug. 2008.

[8] S. Fenn, M.G. Parker, M. Benaissa, and D. Taylor, “Bit-serial multipli-cation in �� �� � using all-one polynomials,” IEE Proc. Com. Digit.Tech., vol. 144, no. 6, pp. 391–393, 1997.

[9] K.-Y. Chang, D. Hong, and H.-S. Cho, “Low complexity bit-parallelmultiplier for �� �� � defined by all-one polynomials using redun-dant representation,” IEEE Trans. Computers, vol. 54, no. 12, pp.1628–1629, Dec. 2005.

[10] H.-S. Kim and S.-W. Lee, “LFSR multipliers over �� �� � definedby all-one polynomial,” Integr., VLSI J., vol. 40, no. 4, pp. 571–578,2007.

[11] P. K. Meher, Y. Ha, and C.-Y. Lee, “An optimized design of serial-par-allel finite field multiplier for�� �� � based on all-one polynomials,”in Proc. ASP-DAC, 2009, pp. 210–215.

[12] M. Sandoval, M. F. Uribe, and C. Kitsos, “Bit-serial and digit-serial�� �� � montgomery multipliers using linear feedback shift regis-ters,” IET Comput. Digit. Tech., vol. 5, no. 2, pp. 86–94, 2011.

[13] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, “Bit-parallel systolic multipliers for�� �� � fields defined by all-one and equally spaced polynomials,”IEEE Trans. Computers, vol. 50, no. 6, pp. 385–393, May 2001.

[14] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, “Ringed bit-parallel systolic mul-tipliers over a class of fields�� �� �,” Integr., VLSI J., vol. 38, no. 4,pp. 571–578, 2005.

[15] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexitybit-parallel systolic montgomery multipliers for special classes of�� �� �,” IEEE Trans. Computers, vol. 54, no. 9, pp. 1061–1070,Sep. 2005.

Page 6: 06129532

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013 173

[16] S. Talapatra, H. Rahaman, and J. Mathew, “Low complexity digit serialsystolic montgomery multipliers for special class of �� �� �,” IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5, pp. 847–852,May 2010.

[17] T. Itoh and S. Tsujii, “Structure of parallel multipliers for a class offields�� �� �,” Inform. Computation, vol. 83, no. 1, pp. 21–40, 1989.

[18] C. Negre, “Quadrinomial modular arithmetic using modified polyno-mial basis,” in Proc. ITCC, 2005, pp. 550–555.

[19] Z. Chen, M. Jing, J. Chen, and Y. Chang, “New viewpoint of bit-serial/parallel normal basis multipliers using irreducible all-one polynomial,”in Proc. ISCAS, 2006, pp. 1499–1502.

[20] C.-Y. Lee, C. W. Chiou, J. M. Lin, and C. C. Chang, “Scalable and sys-tolic montgomery multiplier over �� �� � generated by trinomials,”IET Circuits Dev. Syst., vol. 1, no. 6, pp. 477–484, 2007.

[21] C.-Y. Lee, “Error-correcting codes for concurrent error correction inbit-parallel systolic and scalable multipliers for shifted dual basis of�� �� �,” in Proc. Intern. Sym. Para. Dis. Process. Appl., 2010, pp.405–412.

[22] C.-Y. Lee and C. W. Chiou, “New bit-parallel systolic architecturesfor computing multiplication, multiplicative inversion and division in�� �� � under polynomial basis and normal basis representations,” J.Signal Process. Syst., vol. 52, no. 3, 2008.

[23] C.-W. Chiou, C. C. Chang, C. Y. Lee, T. W. Hou, and J. M. Lin,“Concurrent error detection and correction in Guassian normal basismultiplier over �� �� �,” IEEE Trans. Computers, vol. 58, no. 6, pp.851–857, 2009.

[24] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Im-plementation. New York: Wiley, 1999.

[25] P. K. Meher, “Systolic and non-systolic scalable modular designs offinite field multipliers for Reed-Solomon Codec,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747–757, Jun. 2009.

Design and Implementation of an On-Chip PermutationNetwork for Multiprocessor System-On-Chip

Phi-Hung Pham, Junyoung Song, Jongsun Park, and Chulwoo Kim

Abstract—This paper presents the silicon-proven design of a novelon-chip network to support guaranteed traffic permutation in multipro-cessor system-on-chip applications. The proposed network employs apipelined circuit-switching approach combined with a dynamic path-setupscheme under a multistage network topology. The dynamic path-setupscheme enables runtime path arrangement for arbitrary traffic permuta-tions. The circuit-switching approach offers a guarantee of permuted dataand its compact overhead enables the benefit of stacking multiple net-works. A 0.13- m CMOS test-chip validates the feasibility and efficiencyof the proposed design. Experimental results show that the proposedon-chip network achieves 1.9 to 8.2 reduction of silicon overheadcompared to other design approaches.

Index Terms—Guaranteed throughput, multistage interconnection net-work, network-on-chip, permutation network, pipelined circuit-switching,traffic permutation.

I. INTRODUCTION

A trend of multiprocessor system-on-chip (MPSoC) design beinginterconnected with on-chip networks is currently emerging for appli-cations of parallel processing, scientific computing, and so on [1]–[6].

Manuscript received January 12, 2011; revised June 23, 2011 and October 20,2011; accepted December 07, 2011. Date of publication January 17, 2012; dateof current version December 19, 2012. This work was supported by the NationalResearch Foundation of Korea (NRF) grant funded by the Korea Government(MEST) (2011-0020128).

The authors are with the School of Electrical Engineering, Korea University,Seoul 136-713, South Korea (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2011.2181545

Permutation traffic, a traffic pattern in which each input sends traffic toexactly one output and each output receives traffic from exactly oneinput, is one of the important traffic classes exhibited from on-chipmultiprocessing applications [7], [8]. Standard permutations of trafficoccur in general-purpose MPSoCs, for example, polynomial, sorting,and fast Fourier transform (FFT) computations cause shuffled permuta-tion, whereas matrix transposes or corner-turn operations exhibit trans-pose permutation [6]. Recently, application-specific MPSoCs targetingflexible Turbo/LDPC decoding have been developed, and they exhibitarbitrary and concurrent traffic permutations due to multi-mode andmulti-standard feature [3]–[5]. In addition, many of the MPSoC ap-plications (e.g., Turbo/LDPC decoding [3]–[5]) compute in real-time,therefore, guaranteeing throughput (i.e., data lossless, predictable la-tency, guaranteed bandwidth, and in-order delivery) is critical for suchpermutation traffics.

Most on-chip networks in practice are general-purpose and userouting algorithms such as dimension-ordered routing and minimaladaptive routing. To support permutation traffic patterns, on-chippermutation networks using application-aware routings are neededto achieve better performance compared to the general-purpose net-works [8]. These application-aware routings are configured beforerunning the applications and can be implemented as source routing ordistributed routing. However, such application-aware routings cannotefficiently handle the dynamic changes of a permutation pattern, whichis exhibited in many of the application phases [8]. The difficulty liesin the design effort to compute the routing to support the permutationchanges in runtime, as well as to guarantee [9] the permutated traffics.This becomes a great challenge when these permutation networksneed to be implemented under very limited on-chip power and areaoverhead.

Reviewing on-chip permutation networks (supporting either full orpartial permutation) with regard to their implementation shows thatmost the networks employ a packet-switching mechanism to deal withthe conflict of permuted data [3]–[6]. Their implementations either usefirst-input first-output (FIFO) queues for the conflicting data [3], [5],[6], or time-slot allocation in the overall system with the cost of morerouting stages [5], or a complex routing with a deflection technique thatavoids buffering of the conflicting data [4]. The choices of network de-sign factors, i.e., topology, switching technique and the routing algo-rithm, have different impacts on the on-chip implementation.

Regarding the topology, regular direct topologies, such as mesh andtorus [2], [3], [6], are intuitively feasible for physical layout in a 2-Dchip. On the contrary, the high wiring irregularity and the large routerradix of indirect topologies such as Benes or Butterfly [4], [5] posea challenge for physical implementation [10]. However, an arbitrarypermutation pattern with its intensive load on individual source-des-tination pairs stresses the regular topologies and that may lead tothroughput degradation [7]. In fact, indirect multistage topologies arepreferred for on-chip traffic-permutation intensive applications [4],[5].

Regarding the switching technique, packet switching requires anexcessive amount of on-chip power and area for the queuing buffers(FIFOs) with pre-computed queuing depth at the switching nodesand/or network interfaces [3]–[6].

Regarding the routing algorithm, the deflection routing [4] is notenergy-efficient due to the extra hops needed for deflected data transfer,compared to a minimal routing [2], [3], [5]. Moreover, the deflectionmakes packet latency less predictable; hence, it is hard to guarantee thelatency and the in-order delivery of data.

This paper presents a novel silicon-proven design of an on-chip per-mutation network to support guaranteed throughput of permutated traf-

1063-8210/$31.00 © 2012 IEEE