13
Research Article High Performance and Low Power Hardware Implementation for Cryptographic Hash Functions Yunlong Zhang, 1 Joohee Kim, 1 Ken Choi, 1 and Taeshik Shon 2 1 Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA 2 Division of Information and Computer Engineering, College of Information Technology, Ajou University, San 5, Woncheon-Dong, Yeongtong-Gu, Suwon 443-749, Republic of Korea Correspondence should be addressed to Ken Choi; [email protected] Received 12 September 2013; Accepted 4 January 2014; Published 2 March 2014 Academic Editor: Jongsung Kim Copyright © 2014 Yunlong Zhang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Since hash functions are cryptography’s most widely used primitives, efficient hardware implementation of hash functions is of critical importance. e proposed high performance hardware implementation of the hash functions used sponge construction which generates desired length digest, considering two key design metrics: throughput and power consumption. Firstly, this paper introduces unfolding transformation which increases the throughput of hash function and pipelining and parallelism design techniques which reduce the delay. Secondly, we propose a frequency trade-off technique which can give us a scope of frequency value for making a trade-off between low dynamic power consumption and high throughput. Finally, we use load-enable based clock gating scheme to eliminate wasted toggle rate of signals in the idle mode of hash encryption system. We demonstrated the proposed design techniques by using 45 nm CMOS technology at 10 MHz. e results show that we can achieve up to 47.97 times higher throughput, 6.31% delay reduction, and 13.65% dynamic power reduction. 1. Introduction e explosion of e-commerce nowadays boosts the trans- action over the internet; thus we have to prevent intruders from accessing the sensitive information. According to this circumstance, we call for higher security level protection. ere are many types of modern cryptography, for example, symmetric-key cryptography, public-key cryptography, and cryptographic hash function. Cryptographic hash function is used in almost every modern application, especially in a multitude of protocols, be it as digital signatures for achieving message authentication and integrity protection. For exam- ple, hash-based message authentication codes (HMACs) are used in IP security protocol and also in secure sockets layer (SSL) protocol [1]. As we know, some hash functions, such as message- digest algorithm (MD) series (MD4 and its strengthened variant MD5) and secure hash algorithm (SHA) series (SHA- 0 and SHA-1), were widely used, however, broken in practice. Considering the potential danger of being attacked for SHA- 2, in 2008, the National Institute of Standards and Technology (NIST) has started the NIST hash competition to develop the future hash standard SHA-3 [2]. Although soſtware encryption is becoming more preva- lent today, hardware design is the embodiment of choice for many commercial applications and military [3]. Firstly hardware design is much faster than the corresponding soſtware implementation [4]. Secondly, hardware implemen- tation provides physical protection as high level of security [5]. However, higher security level hash function means more complicated gates, and much more information needs higher frequency to improve the efficiency (or throughput). As a result, the power dissipation of hardware design would increase tremendously. is will cause serious problems in hardware systems, such as less reliability, higher energy consumption, and higher device costs. us, low power techniques are highly appreciated in nowadays hardware design. Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2014, Article ID 736312, 12 pages http://dx.doi.org/10.1155/2014/736312

Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

Research ArticleHigh Performance and Low Power Hardware Implementationfor Cryptographic Hash Functions

Yunlong Zhang1 Joohee Kim1 Ken Choi1 and Taeshik Shon2

1 Electrical and Computer Engineering Illinois Institute of Technology Chicago IL 60616 USA2Division of Information and Computer Engineering College of Information Technology Ajou University San 5Woncheon-Dong Yeongtong-Gu Suwon 443-749 Republic of Korea

Correspondence should be addressed to Ken Choi kchoieceiitedu

Received 12 September 2013 Accepted 4 January 2014 Published 2 March 2014

Academic Editor Jongsung Kim

Copyright copy 2014 Yunlong Zhang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Since hash functions are cryptographyrsquos most widely used primitives efficient hardware implementation of hash functions is ofcritical importance The proposed high performance hardware implementation of the hash functions used sponge constructionwhich generates desired length digest considering two key design metrics throughput and power consumption Firstly this paperintroduces unfolding transformation which increases the throughput of hash function and pipelining and parallelism designtechniques which reduce the delay Secondly we propose a frequency trade-off technique which can give us a scope of frequencyvalue for making a trade-off between low dynamic power consumption and high throughput Finally we use load-enable basedclock gating scheme to eliminate wasted toggle rate of signals in the idle mode of hash encryption system We demonstrated theproposed design techniques by using 45 nm CMOS technology at 10MHz The results show that we can achieve up to 4797 timeshigher throughput 631 delay reduction and 1365 dynamic power reduction

1 Introduction

The explosion of e-commerce nowadays boosts the trans-action over the internet thus we have to prevent intrudersfrom accessing the sensitive information According to thiscircumstance we call for higher security level protectionThere are many types of modern cryptography for examplesymmetric-key cryptography public-key cryptography andcryptographic hash function Cryptographic hash functionis used in almost every modern application especially in amultitude of protocols be it as digital signatures for achievingmessage authentication and integrity protection For exam-ple hash-based message authentication codes (HMACs) areused in IP security protocol and also in secure sockets layer(SSL) protocol [1]

As we know some hash functions such as message-digest algorithm (MD) series (MD4 and its strengthenedvariant MD5) and secure hash algorithm (SHA) series (SHA-0 and SHA-1) were widely used however broken in practice

Considering the potential danger of being attacked for SHA-2 in 2008 theNational Institute of Standards andTechnology(NIST) has started the NIST hash competition to develop thefuture hash standard SHA-3 [2]

Although software encryption is becoming more preva-lent today hardware design is the embodiment of choicefor many commercial applications and military [3] Firstlyhardware design is much faster than the correspondingsoftware implementation [4] Secondly hardware implemen-tation provides physical protection as high level of security[5] However higher security level hash function meansmore complicated gates and much more information needshigher frequency to improve the efficiency (or throughput)As a result the power dissipation of hardware design wouldincrease tremendously This will cause serious problems inhardware systems such as less reliability higher energyconsumption and higher device costs Thus low powertechniques are highly appreciated in nowadays hardwaredesign

Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2014 Article ID 736312 12 pageshttpdxdoiorg1011552014736312

2 International Journal of Distributed Sensor Networks

PadM

0

f

0f f f f

Z

br

c

Absorbing Squeezing

middot middot middot middot middot middot

⨁⨁⨁

Figure 1 Sponge construction [6]

The rest of this paper is organized as follows Spongeconstruction and low power methods which are used inthis paper will be introduced in Section 2 In Section 3 weanalyze the hash function designed by sponge constructionand its original hardware implementation and then unfold-ing transformation and pipelining and parallelism designtechniques used to improve the throughput and delay of hashfunction are presented In Section 4 we construct the hashencryption system and introduce two low power techniquesthe frequency trade-off technique and load-enable basedclock gating scheme This paper is concluded in Section 5

2 Background of the Research

In this section first sponge construction will be explainedNext we will introduce two dynamic power reduction meth-ods which are used in this paper

21 Sponge Construction The idea of sponge constructioncame from the design of RadioGatun and its final definitionwas given at the Ecrypt Hash Workshop in Barcelona [6] Asshown in Figure 1 sponge construction takes arbitrary lengthinput with finite internal state and gives an output of anydesired length

There are three components in sponge construction [7]

(i) a state memory(ii) a function of fixed length that permutes or transforms

the state memory(iii) a padding function

The statememory in Figure 1 is divided into two parts the topsection called bitrate of 119887119903 bits and the bottom section calledcapacity of 119888 bits And the input message (119872 in Figure 1) willbe padded as a wholemultiple of the bitrateThus this paddedinput message could be broken into many 119887119903-bit blocks

Sponge construction consists of two processes absorbingand squeezing Considering the left part of dash line inFigure 1 called absorbing firstly the inputmessage is paddedand the statememorywill be initialized secondly the first 119887119903-bit block of padded input will be XORed with the initial 119887119903bit of state memory thirdly the fixed length function (block119891 in Figure 1) updates the state memory Then steps two andthree will be repeated until all the padded 119887119903-bit blocks areused up Considering the right section which is squeezing

firstly the 119887119903 bit of the latest state memory is the first 119887119903-bit output secondly if we need more output bits the fixedlength function is used to update the state memory and the119887119903 bit of new state memory is the second 119887119903-bit output Thisprocess is repeated until the desired number of output bits (119885in Figure 1) is produced

The extent 119888-bit part which is altered by the inputmessagedepends on the fixed length function [7] The security ofhash function for example resistance to collision or preimageattacks relies on this 119888-bit part Because of its arbitrarilylong input and output sizes the sponge construction allowsbuilding various primitives such as hash function Keccakhash function known as the new SHA-3 uses this spongeconstruction

22 Dynamic Power Reduction Methods Digital circuits willconsume dynamic power in the active mode There are twosources of dynamic power consumption [8]

(i) charging and discharging processes of output capaci-tance

(ii) short-circuit current when PMOS and NMOS net-works are all ON

Because the short circuit power is usually less than 10 oftotal dynamic power [9] the dynamic power consumptionwhichwe try to reduce in this paper is referred to as switchingpower for the rest of this paper Dynamic power can beexplained in (1) Note that 119891 is the clock frequency and TRis the toggle rate of gate output

119875dynamic =1

21198621198711198812

DD119891 sdot TR (1)

Since the power optimization at RTL has significantimpact with reasonable accuracy RTL is considered as theoptimal stage for low power techniques [8] According to(1) four parameters such as voltage clock frequency loadcapacitance and the toggle rate of gate output determinethe dynamic power consumption Because reducing supplyvoltage will increase critical path delay and changing thecapacitance of gate output needs to redesign the load logicit is more efficient to focus on clock frequency and toggle rateat RTL

221 Dynamic VoltageFrequency Scaling Figure 2 gives us abasic dynamic voltagefrequency scaling (DVFS) systemTheDVFS controller will determine the clock frequency whichis sufficient to finish work and gives the best performancewithout overheating by collecting information about theworkload and the temperature Then this variable clockfrequency scheme will lead to dynamic power reduction bychoosing proper clock frequency

222 Load-Enable Based Clock Gating As we all knowcombinational clock gating technique is widely used to solvedynamic power issue for single level register And sequentialclock gating method considers multiple level (pipeline) reg-isters In this research we focus on the combinational clock

International Journal of Distributed Sensor Networks 3

DVFScontroller

Core logic

Switchingvoltage

regulatorVoltage control

Frequency control

Workload

Temperature

Vin

VDD

Figure 2 DVFS system [9]

FFs

E

D

clk

engclk

D[N-10] Q[N-10]

Figure 3 Load-enable based clock gating

gating technique particularly we use load-enable based clockgating scheme [10]

Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research

3 Proposed High-Speed HashingModule in Hardware

Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity

Various techniques have been proposed to speed up orto improve the throughput of hash function for example

Table 1 The parameters of SHAT

SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48

unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds

31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1

311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908

7) Each 4-bit word needs to go through this 119878-box

The definition of the 119878-box is 119904119908119894= 119878box(119908

119894) (119894 = 0 7)

This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer

(((((((((((((

(

1199081015840

0

1199081015840

1

1199081015840

2

1199081015840

3

1199081015840

4

1199081015840

5

1199081015840

6

1199081015840

7

)))))))))))))

)

=

((((

(

01111001

10111100

11010110

11100011

01111110

10110111

11011011

11101101

))))

)

((((

(

1199080

1199081

1199082

1199083

1199084

1199085

1199086

1199087

))))

)

(2)

should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908

0 119904119908

7) this diffusion layer mixes them Diffusion

layer is defined as (2)

312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And

4 International Journal of Distributed Sensor Networks

Table 2 119878-box of the 119866 function

119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865

0 times 1 0 times 2 0 times 9 0 times 8

0 times 2 0 times 4 0 times 119860 0 times 9

0 times 3 0 times 119861 0 times 119861 0 times 7

0 times 4 0 times 119863 0 times 119862 0 times 6

0 times 5 0 times 119864 0 times 119863 0 times 3

0 times 6 0 times 119860 0 times 119864 0 times 0

0 times 7 0 times 5 0 times 119865 0 times 119862

0

0

Perm

br

c

128-i M0 Mn H0 H1 H2 H3

Perm

Perm

Perm

Perm

Perm

Initialization Absorbing Squeezing

⨁ ⨁⨁

middot middot middot

Figure 4 Sponge construction of SHAT

the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878

0 119878

4119894minus1)

(119894 = 1 2 3)In the absorbing phase the input message 119872 =

(11987201198721 119872

119899minus1) shown in Figure 4 is padded as a whole

multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation

(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)

Then we set 1198784119894minus1

as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot

119896are rot

0= 19 rot

1= 1 and rot

2= 14 In the

squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2

SQUEEZE (119878 119894) =

1198783 119894 = 1

1198783 1198787 119894 = 2

1198783 1198787 11987811 119894 = 3

(4)

32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map

According to Table 2 we get the logic functions of 119878-box as

shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-

box and 119876119894(119894 = 0 1 2 3) as the output bit

1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601

+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600

1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600

+ 119860311986021198600+ 1198603119860211986011198600

1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600

+ 119860211986011198600+ 1198603119860211986011198600

1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600

+ 119860311986021198600+ 119860311986021198601

(5)

There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903

119894(119894 = 1 to

47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as

119899= 4 sdot Delay (oplus) + Delay (119892) (6)

33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function

331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus 1198782)

1198781015840

2= 1198781

1198781015840

1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

1198781015840

0= 1198783oplus 119903

(7)

International Journal of Distributed Sensor Networks 5

Step(119878)

(i) For 119896 = 0 to 119894 minus 1(a) 119878

4119896+3= 1198784119896+3

oplus 119903(b) 119878

4119896= 1198784119896oplus 1198784119896+1

(c) 119878

4119896+2= 1198784119896+2

oplus 1198784119896+3

(d) 119878

4119896= 1198784119896oplus 119866(119878

4119896+2)

(e) 1198784119896+2

= 1198784119896+2

oplus (1198784119896ltltlt rot

119896)

(ii) Temp = 1198784119894minus1

(iii) For 119896 = 4119894 minus 1 to 1

119878119896= 119878119896minus1

(iv) 119878

0= Temp

Algorithm 1 Typical one step algorithm

SHAT-(128 sdot 119894)(119872)

Inputs 119899 padded message blocks119872 = (11987201198721 119872

119899minus1)

Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)

(1) 119878 = (1198780 119878

4119894minus1) = (0 0 0 128 sdot 119894) initialization

(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase

(i) For 119896 = 0 to 119894 minus 11198784119896+3

= 1198784119896+3

oplus119872119895119896

(ii) Perm(119878)(4) 119867

0= SQUEEZE(119878 119894) squeezing phase

(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867

119896= SQUEEZE(119878 119894)

Algorithm 2 SHAT-(128 sdot 119894)

Padding unitMessage digest

extraction

SHAT

Control unit

RAM

Padded data

Message digestInput data

n times 32 bits

32-bit wide registers

4 times 32 bits

128 bits

Figure 5 A typical SHAT core

Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840

119894

(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we

ROT

S0

ri

g

S1 S2 S3

S0 S1 S2 S3

Figure 6 Typical architecture of one STEP round

get 24 rounds in one permutation process The expression ofthroughput is given as

Throughput = ( of bits) sdot119891round

of rounds (8)

Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

2 International Journal of Distributed Sensor Networks

PadM

0

f

0f f f f

Z

br

c

Absorbing Squeezing

middot middot middot middot middot middot

⨁⨁⨁

Figure 1 Sponge construction [6]

The rest of this paper is organized as follows Spongeconstruction and low power methods which are used inthis paper will be introduced in Section 2 In Section 3 weanalyze the hash function designed by sponge constructionand its original hardware implementation and then unfold-ing transformation and pipelining and parallelism designtechniques used to improve the throughput and delay of hashfunction are presented In Section 4 we construct the hashencryption system and introduce two low power techniquesthe frequency trade-off technique and load-enable basedclock gating scheme This paper is concluded in Section 5

2 Background of the Research

In this section first sponge construction will be explainedNext we will introduce two dynamic power reduction meth-ods which are used in this paper

21 Sponge Construction The idea of sponge constructioncame from the design of RadioGatun and its final definitionwas given at the Ecrypt Hash Workshop in Barcelona [6] Asshown in Figure 1 sponge construction takes arbitrary lengthinput with finite internal state and gives an output of anydesired length

There are three components in sponge construction [7]

(i) a state memory(ii) a function of fixed length that permutes or transforms

the state memory(iii) a padding function

The statememory in Figure 1 is divided into two parts the topsection called bitrate of 119887119903 bits and the bottom section calledcapacity of 119888 bits And the input message (119872 in Figure 1) willbe padded as a wholemultiple of the bitrateThus this paddedinput message could be broken into many 119887119903-bit blocks

Sponge construction consists of two processes absorbingand squeezing Considering the left part of dash line inFigure 1 called absorbing firstly the inputmessage is paddedand the statememorywill be initialized secondly the first 119887119903-bit block of padded input will be XORed with the initial 119887119903bit of state memory thirdly the fixed length function (block119891 in Figure 1) updates the state memory Then steps two andthree will be repeated until all the padded 119887119903-bit blocks areused up Considering the right section which is squeezing

firstly the 119887119903 bit of the latest state memory is the first 119887119903-bit output secondly if we need more output bits the fixedlength function is used to update the state memory and the119887119903 bit of new state memory is the second 119887119903-bit output Thisprocess is repeated until the desired number of output bits (119885in Figure 1) is produced

The extent 119888-bit part which is altered by the inputmessagedepends on the fixed length function [7] The security ofhash function for example resistance to collision or preimageattacks relies on this 119888-bit part Because of its arbitrarilylong input and output sizes the sponge construction allowsbuilding various primitives such as hash function Keccakhash function known as the new SHA-3 uses this spongeconstruction

22 Dynamic Power Reduction Methods Digital circuits willconsume dynamic power in the active mode There are twosources of dynamic power consumption [8]

(i) charging and discharging processes of output capaci-tance

(ii) short-circuit current when PMOS and NMOS net-works are all ON

Because the short circuit power is usually less than 10 oftotal dynamic power [9] the dynamic power consumptionwhichwe try to reduce in this paper is referred to as switchingpower for the rest of this paper Dynamic power can beexplained in (1) Note that 119891 is the clock frequency and TRis the toggle rate of gate output

119875dynamic =1

21198621198711198812

DD119891 sdot TR (1)

Since the power optimization at RTL has significantimpact with reasonable accuracy RTL is considered as theoptimal stage for low power techniques [8] According to(1) four parameters such as voltage clock frequency loadcapacitance and the toggle rate of gate output determinethe dynamic power consumption Because reducing supplyvoltage will increase critical path delay and changing thecapacitance of gate output needs to redesign the load logicit is more efficient to focus on clock frequency and toggle rateat RTL

221 Dynamic VoltageFrequency Scaling Figure 2 gives us abasic dynamic voltagefrequency scaling (DVFS) systemTheDVFS controller will determine the clock frequency whichis sufficient to finish work and gives the best performancewithout overheating by collecting information about theworkload and the temperature Then this variable clockfrequency scheme will lead to dynamic power reduction bychoosing proper clock frequency

222 Load-Enable Based Clock Gating As we all knowcombinational clock gating technique is widely used to solvedynamic power issue for single level register And sequentialclock gating method considers multiple level (pipeline) reg-isters In this research we focus on the combinational clock

International Journal of Distributed Sensor Networks 3

DVFScontroller

Core logic

Switchingvoltage

regulatorVoltage control

Frequency control

Workload

Temperature

Vin

VDD

Figure 2 DVFS system [9]

FFs

E

D

clk

engclk

D[N-10] Q[N-10]

Figure 3 Load-enable based clock gating

gating technique particularly we use load-enable based clockgating scheme [10]

Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research

3 Proposed High-Speed HashingModule in Hardware

Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity

Various techniques have been proposed to speed up orto improve the throughput of hash function for example

Table 1 The parameters of SHAT

SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48

unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds

31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1

311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908

7) Each 4-bit word needs to go through this 119878-box

The definition of the 119878-box is 119904119908119894= 119878box(119908

119894) (119894 = 0 7)

This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer

(((((((((((((

(

1199081015840

0

1199081015840

1

1199081015840

2

1199081015840

3

1199081015840

4

1199081015840

5

1199081015840

6

1199081015840

7

)))))))))))))

)

=

((((

(

01111001

10111100

11010110

11100011

01111110

10110111

11011011

11101101

))))

)

((((

(

1199080

1199081

1199082

1199083

1199084

1199085

1199086

1199087

))))

)

(2)

should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908

0 119904119908

7) this diffusion layer mixes them Diffusion

layer is defined as (2)

312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And

4 International Journal of Distributed Sensor Networks

Table 2 119878-box of the 119866 function

119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865

0 times 1 0 times 2 0 times 9 0 times 8

0 times 2 0 times 4 0 times 119860 0 times 9

0 times 3 0 times 119861 0 times 119861 0 times 7

0 times 4 0 times 119863 0 times 119862 0 times 6

0 times 5 0 times 119864 0 times 119863 0 times 3

0 times 6 0 times 119860 0 times 119864 0 times 0

0 times 7 0 times 5 0 times 119865 0 times 119862

0

0

Perm

br

c

128-i M0 Mn H0 H1 H2 H3

Perm

Perm

Perm

Perm

Perm

Initialization Absorbing Squeezing

⨁ ⨁⨁

middot middot middot

Figure 4 Sponge construction of SHAT

the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878

0 119878

4119894minus1)

(119894 = 1 2 3)In the absorbing phase the input message 119872 =

(11987201198721 119872

119899minus1) shown in Figure 4 is padded as a whole

multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation

(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)

Then we set 1198784119894minus1

as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot

119896are rot

0= 19 rot

1= 1 and rot

2= 14 In the

squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2

SQUEEZE (119878 119894) =

1198783 119894 = 1

1198783 1198787 119894 = 2

1198783 1198787 11987811 119894 = 3

(4)

32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map

According to Table 2 we get the logic functions of 119878-box as

shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-

box and 119876119894(119894 = 0 1 2 3) as the output bit

1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601

+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600

1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600

+ 119860311986021198600+ 1198603119860211986011198600

1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600

+ 119860211986011198600+ 1198603119860211986011198600

1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600

+ 119860311986021198600+ 119860311986021198601

(5)

There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903

119894(119894 = 1 to

47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as

119899= 4 sdot Delay (oplus) + Delay (119892) (6)

33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function

331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus 1198782)

1198781015840

2= 1198781

1198781015840

1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

1198781015840

0= 1198783oplus 119903

(7)

International Journal of Distributed Sensor Networks 5

Step(119878)

(i) For 119896 = 0 to 119894 minus 1(a) 119878

4119896+3= 1198784119896+3

oplus 119903(b) 119878

4119896= 1198784119896oplus 1198784119896+1

(c) 119878

4119896+2= 1198784119896+2

oplus 1198784119896+3

(d) 119878

4119896= 1198784119896oplus 119866(119878

4119896+2)

(e) 1198784119896+2

= 1198784119896+2

oplus (1198784119896ltltlt rot

119896)

(ii) Temp = 1198784119894minus1

(iii) For 119896 = 4119894 minus 1 to 1

119878119896= 119878119896minus1

(iv) 119878

0= Temp

Algorithm 1 Typical one step algorithm

SHAT-(128 sdot 119894)(119872)

Inputs 119899 padded message blocks119872 = (11987201198721 119872

119899minus1)

Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)

(1) 119878 = (1198780 119878

4119894minus1) = (0 0 0 128 sdot 119894) initialization

(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase

(i) For 119896 = 0 to 119894 minus 11198784119896+3

= 1198784119896+3

oplus119872119895119896

(ii) Perm(119878)(4) 119867

0= SQUEEZE(119878 119894) squeezing phase

(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867

119896= SQUEEZE(119878 119894)

Algorithm 2 SHAT-(128 sdot 119894)

Padding unitMessage digest

extraction

SHAT

Control unit

RAM

Padded data

Message digestInput data

n times 32 bits

32-bit wide registers

4 times 32 bits

128 bits

Figure 5 A typical SHAT core

Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840

119894

(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we

ROT

S0

ri

g

S1 S2 S3

S0 S1 S2 S3

Figure 6 Typical architecture of one STEP round

get 24 rounds in one permutation process The expression ofthroughput is given as

Throughput = ( of bits) sdot119891round

of rounds (8)

Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of Distributed Sensor Networks 3

DVFScontroller

Core logic

Switchingvoltage

regulatorVoltage control

Frequency control

Workload

Temperature

Vin

VDD

Figure 2 DVFS system [9]

FFs

E

D

clk

engclk

D[N-10] Q[N-10]

Figure 3 Load-enable based clock gating

gating technique particularly we use load-enable based clockgating scheme [10]

Figure 3 shows a normal structure of load-enable basedclock gating scheme As we know if the data do not changeduring some consecutive clock periods or the enable signal iskept low those clock periods are wasted This technique canbe applied to a circuit with mux in which an enable signal isa selection signal or a pipeline construction circuit such ashash encryption system in this research

3 Proposed High-Speed HashingModule in Hardware

Cryptographic hash function provides powerful protectionfor data it has been utilized in the security layer of everycommunication protocol However as protocols evolve datasizes and communication speeds are dramatically increasinglow throughput of hash function seems to be a bottleneck inthese digital communications systems A promising solutionis the hardware implementation on reconfigurable deviceswhich combines high flexibility with the speed and physicalsecurity

Various techniques have been proposed to speed up orto improve the throughput of hash function for example

Table 1 The parameters of SHAT

SHAT Hash value Number of stepsSHAT-128 128 bits 48SHAT-256 256 bits 48SHAT-384 384 bits 48

unfolding transformation and pipeline and parallelism tech-niques In this section the characteristics which are relevantto the hardware implementation of the hash algorithm willbe presented Then the high-speed hashing methodologymodule will be introduced based on the delay bound analysisThen two techniques such as unfolding transformation andpipeline and parallelism will be used to optimize the innerlogic of transformation rounds

31 Hash Algorithm Specification In this section we intro-duce a cryptographic hash algorithm with sponge construc-tion called sponge hash algorithm (SHAT) SHAT is a hashfunction generating 128-256-384-bit hash values Accordingto the hash value length SHAT can be denoted by SHAT-(128 sdot 119894) (119894 = 1 2 3) The parameters of SHAT are shown inTable 1

311 119866 Function 119866 function of SHAT consists of an 119878-boxand a diffusion layer 119878-box is a substitution function thatsatisfies the confusion property on each 4-bit word A 32-bitinput word119882 for example is divided into eight 4-bit words(1199080 119908

7) Each 4-bit word needs to go through this 119878-box

The definition of the 119878-box is 119904119908119894= 119878box(119908

119894) (119894 = 0 7)

This 119878-box is specified in Table 2 The diffusion layer is apermutation that satisfies the diffusion property (the same asthe 119875 function of Camellia [11]) Considering computationalefficiency this diffusion layer should be represented usingonly bit-wise exclusive ORs The branch number of diffusionlayer

(((((((((((((

(

1199081015840

0

1199081015840

1

1199081015840

2

1199081015840

3

1199081015840

4

1199081015840

5

1199081015840

6

1199081015840

7

)))))))))))))

)

=

((((

(

01111001

10111100

11010110

11100011

01111110

10110111

11011011

11101101

))))

)

((((

(

1199080

1199081

1199082

1199083

1199084

1199085

1199086

1199087

))))

)

(2)

should be optimal against differential and linear cryptanalysisfor security [11] When we get all eight 4-bit outputs of 119878-box (119904119908

0 119904119908

7) this diffusion layer mixes them Diffusion

layer is defined as (2)

312 Hash Function of SHAT SHAT uses the hermeticsponge construction as shown in Figure 4 As we mentionedin Section 2 119887119903 is called bitrate and 119888 is called capacity And

4 International Journal of Distributed Sensor Networks

Table 2 119878-box of the 119866 function

119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865

0 times 1 0 times 2 0 times 9 0 times 8

0 times 2 0 times 4 0 times 119860 0 times 9

0 times 3 0 times 119861 0 times 119861 0 times 7

0 times 4 0 times 119863 0 times 119862 0 times 6

0 times 5 0 times 119864 0 times 119863 0 times 3

0 times 6 0 times 119860 0 times 119864 0 times 0

0 times 7 0 times 5 0 times 119865 0 times 119862

0

0

Perm

br

c

128-i M0 Mn H0 H1 H2 H3

Perm

Perm

Perm

Perm

Perm

Initialization Absorbing Squeezing

⨁ ⨁⨁

middot middot middot

Figure 4 Sponge construction of SHAT

the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878

0 119878

4119894minus1)

(119894 = 1 2 3)In the absorbing phase the input message 119872 =

(11987201198721 119872

119899minus1) shown in Figure 4 is padded as a whole

multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation

(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)

Then we set 1198784119894minus1

as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot

119896are rot

0= 19 rot

1= 1 and rot

2= 14 In the

squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2

SQUEEZE (119878 119894) =

1198783 119894 = 1

1198783 1198787 119894 = 2

1198783 1198787 11987811 119894 = 3

(4)

32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map

According to Table 2 we get the logic functions of 119878-box as

shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-

box and 119876119894(119894 = 0 1 2 3) as the output bit

1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601

+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600

1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600

+ 119860311986021198600+ 1198603119860211986011198600

1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600

+ 119860211986011198600+ 1198603119860211986011198600

1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600

+ 119860311986021198600+ 119860311986021198601

(5)

There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903

119894(119894 = 1 to

47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as

119899= 4 sdot Delay (oplus) + Delay (119892) (6)

33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function

331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus 1198782)

1198781015840

2= 1198781

1198781015840

1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

1198781015840

0= 1198783oplus 119903

(7)

International Journal of Distributed Sensor Networks 5

Step(119878)

(i) For 119896 = 0 to 119894 minus 1(a) 119878

4119896+3= 1198784119896+3

oplus 119903(b) 119878

4119896= 1198784119896oplus 1198784119896+1

(c) 119878

4119896+2= 1198784119896+2

oplus 1198784119896+3

(d) 119878

4119896= 1198784119896oplus 119866(119878

4119896+2)

(e) 1198784119896+2

= 1198784119896+2

oplus (1198784119896ltltlt rot

119896)

(ii) Temp = 1198784119894minus1

(iii) For 119896 = 4119894 minus 1 to 1

119878119896= 119878119896minus1

(iv) 119878

0= Temp

Algorithm 1 Typical one step algorithm

SHAT-(128 sdot 119894)(119872)

Inputs 119899 padded message blocks119872 = (11987201198721 119872

119899minus1)

Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)

(1) 119878 = (1198780 119878

4119894minus1) = (0 0 0 128 sdot 119894) initialization

(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase

(i) For 119896 = 0 to 119894 minus 11198784119896+3

= 1198784119896+3

oplus119872119895119896

(ii) Perm(119878)(4) 119867

0= SQUEEZE(119878 119894) squeezing phase

(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867

119896= SQUEEZE(119878 119894)

Algorithm 2 SHAT-(128 sdot 119894)

Padding unitMessage digest

extraction

SHAT

Control unit

RAM

Padded data

Message digestInput data

n times 32 bits

32-bit wide registers

4 times 32 bits

128 bits

Figure 5 A typical SHAT core

Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840

119894

(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we

ROT

S0

ri

g

S1 S2 S3

S0 S1 S2 S3

Figure 6 Typical architecture of one STEP round

get 24 rounds in one permutation process The expression ofthroughput is given as

Throughput = ( of bits) sdot119891round

of rounds (8)

Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

4 International Journal of Distributed Sensor Networks

Table 2 119878-box of the 119866 function

119904119908 Sbox(119908) 119904119908 Sbox(119908)0 times 0 0 times 1 0 times 8 0 times 119865

0 times 1 0 times 2 0 times 9 0 times 8

0 times 2 0 times 4 0 times 119860 0 times 9

0 times 3 0 times 119861 0 times 119861 0 times 7

0 times 4 0 times 119863 0 times 119862 0 times 6

0 times 5 0 times 119864 0 times 119863 0 times 3

0 times 6 0 times 119860 0 times 119864 0 times 0

0 times 7 0 times 5 0 times 119865 0 times 119862

0

0

Perm

br

c

128-i M0 Mn H0 H1 H2 H3

Perm

Perm

Perm

Perm

Perm

Initialization Absorbing Squeezing

⨁ ⨁⨁

middot middot middot

Figure 4 Sponge construction of SHAT

the bitrate (119887119903) and the capacity (119888) of SHAT-(128 sdot 119894) (119894 =1 2 3) are 32 sdot 119894 and 96 sdot 119894 respectively The internal state 119878is divided into 4 sdot 119894 (119894 = 1 2 3) sections as 119878 = (119878

0 119878

4119894minus1)

(119894 = 1 2 3)In the absorbing phase the input message 119872 =

(11987201198721 119872

119899minus1) shown in Figure 4 is padded as a whole

multiple of bitrate (119887119903) Then we will explain our paddingmethod 119897 is the total length of input message (we assume that119897 is whole multiple of four as integer multiples of hexadecimalnumber) and then we append 1 to the end of the messagefollowed by 119896 bits zero where 119896 is the smallest nonnegativeinteger to set up the following formulation

(119897 + 1 + 119896) mod (32 sdot 119894) = 0 (3)

Then we set 1198784119894minus1

as the bitrate that used to be XORedwith the padded 119887119903-bit message block Then the result goesthrough that one-way compression function Perm Permis a permutation process which has 48 steps Each STEPis defined in Algorithm 1 In Algorithm 1 the left circularrotations rot

119896are rot

0= 19 rot

1= 1 and rot

2= 14 In the

squeezing phase SHATwas defined in (4)This SHAT-(128sdot119894)(119894 = 1 2 3) is specified in Algorithm 2

SQUEEZE (119878 119894) =

1198783 119894 = 1

1198783 1198787 119894 = 2

1198783 1198787 11987811 119894 = 3

(4)

32 Hardware Implementation Following the guidelines ofSHAT-(128 sdot 119894) (119894 = 1 2 3) as shown in Algorithm 2 thearchitecture of SHAT is illustrated in Figure 5119878-box of 119866 function is designed from Karnaugh map

According to Table 2 we get the logic functions of 119878-box as

shown in (5) We set 119860119894(119894 = 0 1 2 3) as the input bit of 119878-

box and 119876119894(119894 = 0 1 2 3) as the output bit

1198763= 1198603119860211986011198600+ 1198603119860211986011198600+ 119860311986021198601

+ 119860311986021198600+ 119860311986021198601+ 119860311986021198600

1198762= 119860311986011198600+ 119860311986021198601+ 119860311986011198600

+ 119860311986021198600+ 1198603119860211986011198600

1198761= 119860311986011198600+ 119860211986011198600+ 119860311986021198600

+ 119860211986011198600+ 1198603119860211986011198600

1198760= 119860311986011198600+ 119860311986011198600+ 1198603119860211986011198600

+ 119860311986021198600+ 119860311986021198601

(5)

There are 48 iteration rounds in the basic architecture ofPerm functionThen we use rolling loop technique to reducearea requirement Our design is a single operation blockwhich is reused 48 times as shown in Figure 6Here 119903

119894(119894 = 1 to

47) is a counter for the number of iteration rounds from 0 to47The critical path is highlighted by bold line Since the delayof circular shift is negligible in hardware implementation thecritical path delay of this architecture is shown as

119899= 4 sdot Delay (oplus) + Delay (119892) (6)

33 ProposedHigh-SpeedModule In the previous section weintroduce rolling loop technique to construct Perm functionAlthough this approach considers area efficiency throughputis kept low due to the requirement of 48 clock cycles togenerate the result There are many architectures that can bemade by varying the Perm function to solve this problemWe performed the unfolding transformation technique Thishigh-speed module combines STEP blocks into a singleround and even can take advantage of architectures withcomplete round-unrolled circuit By unfolding the hiddenconcurrencies can be parallelized [12] Also in [13] thepipeline and parallelism technique was explained to improvethe unfolding construction of hash function This techniqueis related to precomputing by analysing the inner logic andarchitecture of hash function

331 Unfolding Transformation According to Figure 6 themathematical expression of one iteration round is describedas

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus 1198782)

1198781015840

2= 1198781

1198781015840

1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

1198781015840

0= 1198783oplus 119903

(7)

International Journal of Distributed Sensor Networks 5

Step(119878)

(i) For 119896 = 0 to 119894 minus 1(a) 119878

4119896+3= 1198784119896+3

oplus 119903(b) 119878

4119896= 1198784119896oplus 1198784119896+1

(c) 119878

4119896+2= 1198784119896+2

oplus 1198784119896+3

(d) 119878

4119896= 1198784119896oplus 119866(119878

4119896+2)

(e) 1198784119896+2

= 1198784119896+2

oplus (1198784119896ltltlt rot

119896)

(ii) Temp = 1198784119894minus1

(iii) For 119896 = 4119894 minus 1 to 1

119878119896= 119878119896minus1

(iv) 119878

0= Temp

Algorithm 1 Typical one step algorithm

SHAT-(128 sdot 119894)(119872)

Inputs 119899 padded message blocks119872 = (11987201198721 119872

119899minus1)

Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)

(1) 119878 = (1198780 119878

4119894minus1) = (0 0 0 128 sdot 119894) initialization

(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase

(i) For 119896 = 0 to 119894 minus 11198784119896+3

= 1198784119896+3

oplus119872119895119896

(ii) Perm(119878)(4) 119867

0= SQUEEZE(119878 119894) squeezing phase

(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867

119896= SQUEEZE(119878 119894)

Algorithm 2 SHAT-(128 sdot 119894)

Padding unitMessage digest

extraction

SHAT

Control unit

RAM

Padded data

Message digestInput data

n times 32 bits

32-bit wide registers

4 times 32 bits

128 bits

Figure 5 A typical SHAT core

Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840

119894

(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we

ROT

S0

ri

g

S1 S2 S3

S0 S1 S2 S3

Figure 6 Typical architecture of one STEP round

get 24 rounds in one permutation process The expression ofthroughput is given as

Throughput = ( of bits) sdot119891round

of rounds (8)

Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of Distributed Sensor Networks 5

Step(119878)

(i) For 119896 = 0 to 119894 minus 1(a) 119878

4119896+3= 1198784119896+3

oplus 119903(b) 119878

4119896= 1198784119896oplus 1198784119896+1

(c) 119878

4119896+2= 1198784119896+2

oplus 1198784119896+3

(d) 119878

4119896= 1198784119896oplus 119866(119878

4119896+2)

(e) 1198784119896+2

= 1198784119896+2

oplus (1198784119896ltltlt rot

119896)

(ii) Temp = 1198784119894minus1

(iii) For 119896 = 4119894 minus 1 to 1

119878119896= 119878119896minus1

(iv) 119878

0= Temp

Algorithm 1 Typical one step algorithm

SHAT-(128 sdot 119894)(119872)

Inputs 119899 padded message blocks119872 = (11987201198721 119872

119899minus1)

Outputs (128 sdot 119894)-bit hash value (1198670 1198671 1198672 1198673)

(1) 119878 = (1198780 119878

4119894minus1) = (0 0 0 128 sdot 119894) initialization

(2) Perm(119878)(3) For 119895 = 0 to 119899 minus 1 absorbing phase

(i) For 119896 = 0 to 119894 minus 11198784119896+3

= 1198784119896+3

oplus119872119895119896

(ii) Perm(119878)(4) 119867

0= SQUEEZE(119878 119894) squeezing phase

(5) For 119896 = 1 to 3(i) Perm(119878)(ii) 119867

119896= SQUEEZE(119878 119894)

Algorithm 2 SHAT-(128 sdot 119894)

Padding unitMessage digest

extraction

SHAT

Control unit

RAM

Padded data

Message digestInput data

n times 32 bits

32-bit wide registers

4 times 32 bits

128 bits

Figure 5 A typical SHAT core

Here 119878119894(119894 = 0 1 2 3) is the input of current round and 1198781015840

119894

(119894 = 0 1 2 3) is the output of this round (or input of nextround) In order to distribute 48 operations equally over eachround the possible values for unfolding factors are divisorsof 48 that is 1 2 3 4 6 8 12 16 24 and 48 For examplewe can unfold two STEP operations in each round then we

ROT

S0

ri

g

S1 S2 S3

S0 S1 S2 S3

Figure 6 Typical architecture of one STEP round

get 24 rounds in one permutation process The expression ofthroughput is given as

Throughput = ( of bits) sdot119891round

of rounds (8)

Considering (7) although this unfolding transformationreduces the maximum operation frequency the throughputis increased significantly due to the fact that the operation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

6 International Journal of Distributed Sensor Networks

numbers are reduced from 48 to 24 The mathematicalexpression of one iteration round is replaced by

temp3= ROT (temp

1) oplus (temp

0oplus 1198782)

temp2= 1198781

temp1= 119866 (119878

3oplus 119903 oplus 119878

2) oplus (119878

0oplus 1198781)

temp0= 1198783oplus 119903

1198781015840

3= ROT (1198781015840

1) oplus (119878

1015840

0oplus temp

2)

1198781015840

2= temp

1

1198781015840

1= 119866 (temp

3oplus 119903 oplus temp

2) oplus (temp

0oplus temp

1)

1198781015840

0= temp

3oplus 119903

(9)

332 Pipeline and Parallelism We assume to unroll twoSTEP operations in each round for sure it will reduce thefrequency to increase the throughputHowever the increasedarea is introduced as penalty If some logics can be done inparallel and this parallelism happens in critical path then thedelay of each round could be decreased so that the frequencyof each operation will be increased According to (8) whenthe number of operations is kept as constant (the numberof bits is also kept as constant) the throughput will increasewith its frequency This method could be used in any otherhardware implementation of hash function

For example Figure 7 shows the architecture of unfoldingtwo STEP operations in one round which has the minimumcritical path delayThe critical path is composed of sevenXORgates and two 119866 functions By unfolding two STEPs in oneround we have a gain of three 32-bit XOR gates and one 119866function in critical path comparing with the architecture ofone STEP block The critical path is highlighted by bold line

In Figure 7 cycle counter 119903119894+1

can be calculated withtemp2first and then XORed with temp

3in second STEP

part Comparing with the first STEP part where 119903119894XORed

with 1198783and then XORed with 119878

2 we can figure out that

there is another additional component which used to makea calculation with temp

3and 119903119894+1

Because of the mandatoryoutput generation necessity this area penalty cannot beavoided

Thus when we increase the number of unfolding STEPoperations for example three four five each round delaywill increase by three 32-bit XOR gates and one 119866 functionTherefore the normalized delay with unfolding factor 119899 (119899 =1 2 3 ) is shown as

119899=4sdotDelay (oplus)+Delay (119892)+(119899minus1)sdot(3sdotDelay(oplus)+Delay(119892))

119899

(10)When we have a limit of 119899 (10) could be changed into

lim119899rarrinfin

119899= 3 sdot Delay (oplus) + Delay (119892) (11)

This is the delay bound of SHAT which means that a delay ofone SHAT operation round cannot be less than this bound

ROT

ROT

g

g

S0 S1 S2 S3

S0 S1 S2 S3

ri + 1

temp2 temp0

temp1

ri + 1

ri

temp3

⨁⨁

⨁ ⨁

Figure 7 Proposed architecture of two STEPs round

34 Experimental Results We introduce a measurement ofhardware efficiency in (12) [14] This is the improvement ofnormal figure of merit (FOM) We assume that the poweris proportional to the gate count then we could divide themetric by another GE instead of power dissipation when wewant to trade off throughput for power Note that one gateequivalent (GE) is equal to the area of two-input NAND gatein 45 nm CMOS technology

FOM =Throughput

GE2 (12)

Table 3 shows the hardware implementation results ofsome 128-bit hash functions by using 100 kHz clock frequencyand 45 nm CMOS technique Firstly the throughput ofSHAT-128 (6667 kbps) is less than that of other 5 hashalgorithms such as MD4 (11228 kbps) MD5 (8366 kbps)H-Present-128-32-round (200 kbps) and ARMADILLO2-B(250 kbps and 1000 kbps) However the area of SHAT-128is only 2842 of that of hash functions in average Thisresults in having the hardware efficiency of SHAT-128 to be1312 times higher in average Secondly the area of SHAT-128 (1605GE) is larger than that of 3 hash algorithmsfor example U-QUARK-544-round (1379GE) PHOTON-128-996-round (1122GE) and SPONGENT-128-8-bit-2380-round (1060GE) however the throughput of SHAT-128 is9427 times higher Thus the FOM of SHAT-128 is 4675times higher in average Finally the area of SHAT-128(1605GE) is less than that of other 4 hash algorithms forexample H-Present-128-559-round (2330GE) U-QUARK-68-round (2392GE) PHOTON-128-156-round (1708GE)and SPONGENT-128-70-round (1687GE) And the through-put of SHAT-128 is also 595 times higher than that of 4 hashalgorithms in average This results in having the FOM ofSHAT-128 to be 966 times higher in average

In Table 4 firstly the throughput of SHAT-256 is 5105of that of Grostl however the area of SHAT-256 is only2184 of that of Grostl this results in having 8447

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of Distributed Sensor Networks 7

Table 3 Hardware implementation results of some 128-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-128 32 48 6667 1605 25880

H-Present-128 [15]128 559 1145 2330 2109128 32 200 4256 11041

MD4 [15] 512 456 11228 7350 2078MD5 [15] 512 612 8366 8400 1186

ARMADILLO2-B [15]64 256 250 4353 131964 64 1000 6025 2755

U-QUARK [15]8 544 147 1379 7738 68 1176 2392 2056

PHOTON-128 [15]16 996 161 1122 127816 156 1026 1708 3515

SPONGENT-128 [15]8 2380 034 1060 29916 70 1143 1687 4016

Table 4 Hardware implementation results of some 256-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-256 64 48 13333 3193 13078SHA-256 [14] 512 490 10448 8588 1417

ARMADILLO2-E [14]128 512 25 8653 334128 128 100 11914 705

BLAKE [14] 32 816 7279 13575 021Grostl [14] 64 196 26114 14622 153

PHOTON-256 [14]32 156 321 2177 67832 156 2051 4362 1017

SPONGENT-256 [14]16 9520 017 1950 04416 140 1143 3281 1062

Table 5 Hardware implementation results of some 384-bit hash functions

Hash function Block size(bits) Number of operations Throughput at 100 kHz

(kbps)Area(GE) FOM

SHAT-384 96 48 200 4753 8853SHA-384 [14] 1024 84 121904 43330 649

Table 6 Performance results of hash function using pipeline and parallelism

Number of iteration rounds Area Delay Power(GE) Increase () (ns) Reduction () (120583W) Increase ()

48 965 000 094 000 2727 00024 2010 415 187 260 7905 09116 3055 553 281 344 13642 35212 4100 622 374 435 19371 4678 6190 691 561 492 30848 5736 8280 725 747 532 42318 6214 12460 760 1120 564 65262 6683 16640 777 1493 574 88200 6902 25000 794 2240 588 134080 7121 50080 812 4470 631 269540 740

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

8 International Journal of Distributed Sensor Networks

Table 7 Performance results of unrolling steps constructions

Number ofiteration rounds

Area(GE)

Delay(ns)

Power(120583W)

Throughput at10MHz(Mbps)

48 965 094 2727 66724 1930 192 7834 133316 2895 291 13178 200012 3860 391 18506 26678 5790 590 29177 40006 7720 789 39842 53334 11580 1187 61178 80003 15440 1584 82506 106672 23160 2380 125170 160001 46320 4771 250970 32000

times higher hardware efficiency of SHAT-256 Secondlythe throughput of SHAT-256 (3193GE) is 41291 timeshigher than that of 2 hash algorithms such as PHOTON-256-156-round (2177GE) and SPONGENT-256-9520-round(1950GE) in average although the area of SHAT-256 islarger the FOM of SHAT-256 is still 15825 times higher thanthat of 2 hash algorithms Thirdly comparing with SHA-256 ARMADILL02-E BLAKE PHOTON-256-156-roundand SPONGENT-256-140-round the throughput of SHAT-256 is 465 times higher in average and the area of SHAT-256 is only 4915 of that of hash algorithms in averageTherefore the FOM of SHAT-256 is 11914 times higher inaverage

In Table 5 the throughput of SHA-384 is 609 timeshigher than that of SHAT-384 however the area of SHA-384 is 911 times higher this results in having the hardwareefficiency of SHAT-384 to be 1364 times higher than that ofSHA-384

Then we implement unfolding transformation techniquewith 10 different numbers of unrolling loops (1 2 48) byusing 45 nm CMOS technology at 10MHz to evaluate theperformances of SHAT-128 the results are shown in Table 7As we can see in Table 7 the throughput of PERM functioncan be achieved up to 4797 times higher than original onewhich is 667Mbps However area delay and power willincrease dramatically as penalty

Finally we implement pipeline and parallelism techniqueto reconstruct STEP block as shown in Table 6 comparingwith the performances of original circuit the critical pathdelay reduces to 631 at most while the power and area willincrease in 8

4 Low Power Design for Hash Function

Low power design is a significant consideration in hardwareimplementation How much the power consumption is willdetermine a devicersquos life reliability and energy cost Thuslow power technique is applied normally to every applica-tion nowadays There are many methods to reduce powerconsumption such as clock gating and power gating related

to dynamic power and leakage power Frequency decreasingtechnique will pull down the power dissipation dramaticallyas well

Firstly wewill propose the frequency trade-off techniqueBy using this method we could achieve a range of frequencyvalues for making a trade-off between low power consump-tion and high throughput of hash function Secondly weconstruct a hash encryption system which includes inputdata padding unit RAM registers main hash computingconstruction message digest extraction component andmain control unit Thirdly by analyzing the idle mode andcontrol signals of this hash encryption system load-enablebased clock gating scheme is applied to reduce the dynamicpower consumption

41 Frequency Trade-Off Technique According to (1) reduc-ing clock frequency is an effective method to decreasedynamic power dissipation linearly In Section 22 we talkedabout the DVFS technique By collecting the informationabout workload and temperature DVFS will determinethe sufficient clock frequency for the proper performanceHowever modifying the clock frequency at RTL is not easyNormally we treat the clock frequency as constant Also aswe know dynamic frequency scaling reduces the number ofoperations a system can issue in a given amount of time thusreducing performanceTherefore there is an issue we need toconsider high clock frequency brings high level throughputhowever dramatically increased dynamic power consump-tion is the critical drawback Low clock frequency minimizesthe dynamic power dissipation however it decreases thethroughput as well

However according to the unfolding transformationtechnique which is introduced in Section 33 the maximumfrequency of Perm function will decrease while the numberof unrolling loops increases It means that we can decreasethe clock frequency while increasing throughput of thehash algorithm Thus this unrolling transformation tech-nique compromises high performance without high clockfrequency According to this advantage by choosing properclock frequency we can make a trade-off between highperformance and low power consumption

Next we explain how to get this scope of frequency valuefrom the two performance bounds For example first weachieve two values of rolling Perm circuit dynamic powerconsumption 119875

1and clock frequency 119891

1which is defined by

the necessity of circuit design (the clock period computedfrom 119891

1needs to be not less than the critical path delay)

Then according to (8) we can get the throughput 1198791at this

frequency Thus those two performance bounds are definedin (13) where 119899 is the number of iteration rounds in one Permfunction with rolling STEPs

119875max = 1198751 sdot 119899

119879min = 1198791(13)

This method can be defined as the following referringto the performance of original folding circuit (we assumethat this circuit is the one with 48 iteration rounds in one

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of Distributed Sensor Networks 9

Receiver

RAM

Maincontrol LCD

displayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Outputdigestn times br bits

and

Figure 8 Hash encryption system

Perm function) each unfolding transformation design withdifferent numbers of unrolling STEPs (2 3 48) has twoperformance bounds one is maximum dynamic power andthe other is minimum throughput of the circuit These twoperformance bounds are used to determine the boundary ofproper frequency range for each unfolding transformationcircuit It means that when we choose one specific clockfrequency in this value scope the total dynamic powerconsumption of that PERM function will be not more thandefined maximum dynamic power 119875max and its throughputwill be not less than that fixed minimum throughput 119879min

This clock frequency scope gives us many differentchoices for different circuit designs by using unfoldingtransformation techniqueThe results of this frequency trade-off technique are shown in Table 9 in Section 44

42 Hash Encryption System Design The hash encryptionsystem is divided into 5 main parts as shown in Figure 8

Firstly the receiver and RAM section is actually ourpadding unit We use serial communication technique toconnect PC and the hash encryption system Thus weneed clock divider to generate proper clock cycle to besynchronous with Baud rate of serial communication Wechoose 4800 Bauds as our transmission Baud rate which isnot a quick speed for low error rate (less than 3) In thiscase one Baud represents 1 bit Our rule of transmission isa one start bit ldquo0rdquo then 8-bit message and one finish bit ldquo1rdquoThis start bit and finish bit will be added into the transmissionmessage bits automatically the sampling rate of receiver is 16and FPGA board provides 100MHz clock frequency Thusthe clock period used in sampling is 1302 times provided100MHz clock period as shown in (14)This error is 00064less than 3

Sampling Clock Cycles =Clock Frequency

Baud rate sdot Sampling Rate

=100MHz

16 times 4800Bs

asymp 1302

(14)

Because the liquid crystal display (LCD) limits the numberof characters we can display which are 32 characters inhexadecimal this number is suitable for the number ofdigest bits of SHAT-128 Thus our 119887119903 for each padded blockis determined to be 32 bits which consist of eight 4-bithexadecimal numbers

Secondly hash functionwhichwe introduced in Section 3is designed as sponge construction as shown in Figure 4Absorbing 119899 32-bit message blocks there are 128 bits digestthat will be squeezed out

Finally the main control unit is designed for managingthe working order between receiver hash process and LCDdisplay Figure 9 shows the pipeline working of system

Because we use serial communication technique thespeed will be slow We apply 4800 Bauds as our Baud ratefor low error rate thus each 32-bit block needs roughly 7msFor example there are seven 32-bit blocks that need to betransmitted roughly 50ms needs to be dissipated for datareceiving and padding Although the hash function that weused in this system is one STEP each round this means thatthere are 48 iteration rounds for a complete Perm functionHowever hash processing just needs roughly 6 120583s It also costsmuch time in LCD displaying period Even though we canfinish LCD initialization before we get hash digest we stillneed roughly 15ms to completely display all data

43 Load-Enable Based Clock Gating In this section weintroduce the load-enable based clock gating technique forthe hash encryption system

Clock gating is themost widely used low power techniqueat RTL It is more reasonable to determine the toggle rate ofgate output at RTL than any other three components such as119881DD clock frequency and gate output capacitance Accordingto Figure 9 the hash encryption system is composed of apipeline construction Finishing signal of each process canbe treated as enable signal in load-enable based clock gatingas shown in Figure 3 On the other hand XOR-based clockgating technique needs to specify the outputs of single levelflip-flops which is not easily determined in our encryptionsystem thus the load-enable based clock gating is our bestoption for low power method

As shown in Figure 10 there are three signal pairs torealize this load-enable based clock gating 119890119899 119889119894V and 119891119904ℎ 119903119890119899 ℎ and 119891119904ℎ ℎ and 119890119899 119897119888119889 and 119891119904ℎ 119897119888119889 Because receiveris implemented in a specific clock frequency which is cor-responding to the serial communication the main controlunit will not gate the clock signal of receiver directly bycontrolling the clock signal of clock divider with 119890119899 119889119894Vreceiver can be properly managed

Figure 9 gives us three operation phases of the encryptionsystem In first phrase 119890119899 119889119894V and 119890119899 119897119888119889 signals are assertedto logic one and 119890119899 ℎ is asserted to logic zero thus receiverstarts receiving input messages and padding them into RAMAt the meantime system will begin the initialization processfor LCD displayer However the hash processing unit iswaiting for the padded input message Considering the serialcommunication takes long time due to the low Baud rate andits characteristic which is transmitting message bit one by

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

10 International Journal of Distributed Sensor Networks

Receiver

Hash

LCD

Phase one Phase two Phase three

Data receiving and padding

Idle

Initialization Idle

Idle

IdleHash processing

LCD displaying

Figure 9 Three phases of hash encryption system

Receiver

RAM

Maincontrol

LCDdisplayer

Hashprocess

Clockdivider

Inputmessage

Digestdisplay

Paddedmessage

Digest

en lcd

fsh r

en di

en h fsh h

fsh lcd

clk r

and

Figure 10 Control signals of hash encryption system

one LCD displayer initialization can be finished before thepaddedmessage is readyThus 119890119899 119897119888119889 can be asserted to logiczero by main control unit when 119891119904ℎ 119897119888119889 is switching to logicone

During the second phase because the padded messageis ready then 119891119904ℎ 119903 switches to logic one Then 119890119899 119889119894V isasserted to zero which means that clock divider is turnedoff then no specific clock frequency is produced thus thereceiver will stop working In this phase 119890119899 ℎ is asserted tologic one for hash encryption which is our core function119890119899 119897119888119889 is still zero waiting for the hash digest generated byhash processing

This system will enter the third phase when the 119891119904ℎ ℎsignal switches to logic one In this phase hash digest isready thus both receiver and hash processes are in idle modewhich means that 119890119899 119889119894V and 119890119899 ℎ are all asserted to logiczero Signal 119890119899 119897119888119889 will be asserted to logic one to start LCDdisplaying 119890119899 119897119888119889 will be asserted back to zero when thedisplaying process is finished This is the end of the wholesystem then the device will be turned off or repeats thesethree phases for another input message

By analyzing the construction and process of hashencryption system we can figure out the idle time foreach component Then applying the load-enable based clockgating to each component the dynamic power dissipation ofthis system can be properly reduced as shown in Table 8 inSection 44

44 Experimental Results By using 10MHz clock frequencyand 45 nm CMOS technology the results of frequency trade-off technique are shown in Tables 9 10 and 11 Table 9 showsthat the area and critical path delay are not changed compar-ing with the unfolding transformation technique Tables 10

Table 8 Hardware implementationwithwithout load-enable basedclock gating

Systemtype

Area Delay Power

(GE) Increase() (ns) Increase

() (120583W) Reduction()

Original 14053 na 163 na 183020 naClockgated 14565 364 172 552 158036 1365

Table 9 Area and delay performances of frequency trade-offtechnique

Number ofiteration rounds

Area(GE)

Delay(ns)

Frequency(MHz)

48 965 094 100024 1930 192 500 lt 119891

24lt 696

16 2895 291 333 lt 11989116lt 620

12 3860 391 250 lt 11989112lt 589

8 5790 590 167 lt 1198918lt 560

6 7720 789 125 lt 1198916lt 547

4 11580 1187 083 lt 1198914lt 534

3 15440 1584 063 lt 1198913lt 528

2 23160 2380 042 lt 1198912lt 522

1 46320 4771 021 lt 1198911lt 521

and 11 give us the variation of dynamic power consumptionand throughput with frequency trade-off method Note that119891119894stands for frequency 119879

119894stands for throughput and 119879

119894pct isthe percentage of increasing comparing with the minimumthroughput (119879min) which is 667Mbps 119875

119894means the total

dynamic power consumption by finishing a complete Permfunction and 119875

119894pct is the percentage of power reductioncomparing with the maximum power consumption (119875max)defined as 130896 120583W which is calculated from the productof 48 (number of iteration rounds) and 2727120583W(as shown inTable 7)Note that 119894 stands for the number of iteration rounds

Then we apply load-enable based clock gating schemeto hash encryption system by using 100MHz clock fre-quency which can be provided on FPGA board and 45 nmCMOS technology As shown in Table 8 the dynamic powerdecreases 1365 However 364 increased area and 552increased critical path delay are sacrificed

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of Distributed Sensor Networks 11

Table 10 Dynamic power consumption of frequency trade-off technique

Number of iteration rounds Power Frequency(120583W) Reduction () (MHz)

48 130896 na 100024 94008 lt 119875

24lt 130848 2818 lt 119875

24pct lt 004 500 lt 11989124lt 696

16 70288 lt 11987516lt 130720 4630 lt 119875

16pct lt 013 333 lt 11989116lt 620

12 55512 lt 11987512lt 130800 5759 lt 119875

12pct lt 007 250 lt 11989112lt 589

8 38904 lt 1198758lt 130712 7028 lt 119875

8pct lt 014 167 lt 1198918lt 560

6 29880 lt 1198756lt 130764 7717 lt 119875

6pct lt 010 125 lt 1198916lt 547

4 20392 lt 1198754lt 130676 8442 lt 119875

4pct lt 017 083 lt 1198914lt 534

3 15471 lt 1198753lt 130689 8818 lt 119875

3pct lt 016 063 lt 1198913lt 528

2 10430 lt 1198752lt 130676 9203 lt 119875

2pct lt 017 042 lt 1198912lt 522

1 5229 lt 1198751lt 130760 9601 lt 119875

1pct lt 010 021 lt 1198911lt 521

Table 11 Throughput performances of frequency trade-off technique

Number of iteration rounds Throughput Frequency(Mbps) Improvement () (MHz)

48 667 na 100024 667 lt 119879

24lt 928 000 lt 119879

24pct lt 3913 500 lt 11989124lt 696

16 667 lt 11987916lt 124 000 lt 119879

16pct lt 8591 333 lt 11989116lt 620

12 667 lt 11987912lt 1571 000 lt 119879

12pct lt 13553 250 lt 11989112lt 589

8 667 lt 1198798lt 2240 000 lt 119879

8pct lt 23583 167 lt 1198918lt 560

6 667 lt 1198796lt 2917 000 lt 119879

6pct lt 33733 125 lt 1198916lt 547

4 667 lt 1198794lt 4272 000 lt 119879

4pct lt 54048 083 lt 1198914lt 534

3 667 lt 1198793lt 5632 000 lt 119879

3pct lt 74438 063 lt 1198913lt 528

2 667 lt 1198792lt 8352 000 lt 119879

2pct lt 115217 042 lt 1198912lt 522

1 667 lt 1198791lt 16672 000 lt 119879

1pct lt 239955 021 lt 1198911lt 521

5 Conclusion

In order to achieve high performance and low power hard-ware implementation for cryptographic hash function whichuses sponge construction firstly we use unfolding transfor-mation technique to improve the throughput of hash func-tion secondly pipeline and parallelism design techniques areimplemented to reduce the critical path delay by modifyingthe structure of permutation function thirdly frequencytrade-off technique is proposed to calculate a frequency scopewhich can be used to make a trade-off between low dynamicpower consumption and high throughput of hash functionfinally load-enable based clock gating scheme is applied inhash encryption system to eliminate wasted toggle rate ofsignals in the idle mode

The experimental results have shown that unfoldingtransformation technique can achieve up to 4797 timeshigher throughput pipeline and parallelism methods give631delay reduction load-enable based clock gating schemedecreases 1365 dynamic power consumption and fre-quency trade-off technique shows how to decide the clockfrequency of the hash function to achieve low power con-sumption and high throughput

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy) Korea under the ITRC (InformationTechnology Research Center) support program (NIPA-2012-H0301-12-3007) supervised by the NIPA (National IT Indus-try Promotion Agency)

References

[1] H Michail and C Goutis ldquoHolistic methodology for design-ing ultra high-speed SHA-1 hashing cryptographic module inhardwarerdquo in Proceedings of the IEEE International Conferenceon Electron Devices and Solid-State Circuits (EDSSC rsquo08) pp 1ndash4 Hong Kong December 2008

[2] ldquoCryptographic hash algorithm competitionrdquo NIST ComputerSecurity Resource Center httpcsrcnistgovgroupsSThashsha-3indexhtml

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

12 International Journal of Distributed Sensor Networks

[3] B Schneier Applied Cryptography Protocols Algorithms andSource Code in C JohnWiley amp Sons New York NY USA 2ndedition 1996

[4] J Nakajima andMMitsuru ldquoPerformance analysis and parallelimplementation of dedicated hash functionrdquo in Proceedings ofthe International Conference on the Theory and Applications ofCryptographic Techniques (EUROCRYPT rsquo02) vol 2332 pp 165ndash180 Amsterdam The Netherlands 2002

[5] P C van Oorschot A Somayaji and G Wurster ldquoHardware-assisted circumvention of self-hashing software tamper resis-tancerdquo IEEETransactions onDependable and Secure Computingvol 2 no 2 pp 82ndash92 2005

[6] G Bertoni J Daemen M Peeters and G van AsscheldquoCryptog-raphic sponge functionsrdquoThe Sponge Functions Cor-ner httpspongenoekeonorgindexhtml

[7] ldquoSponge functionrdquo WIKIPEDIA httpenwikipediaorgwikiSponge function

[8] L Li Power optimization from register transfer level to transistorlevel in deeply scaled CMOS technology [PhD thesis] IllinoisInstitute of Technology Chicago Ill USA 2012

[9] N Weste and D Harris CMOS VLSI Design A Circuits andSystems Perspective Addison-Wesley Reading Mass USA2010

[10] Y Zhang Q Tong L Li et al ldquoAutomatic register transferlevel CAD tool design for advanced clock gating and lowpower schemesrdquo in Proceeding of the International SoC DesignConference (ISOCC rsquo12) pp 21ndash24 Jeju Island Republic ofKorea 2012

[11] K Aoki T Ichikawa and M Kanda ldquoSpecification of Camel-liamdasha 128-bit block cipherrdquo Nippon Telegraphy and TelephoneCorporation Mitsubishi Electric Corporation 2000

[12] Y K Lee H Chan and I Verbauwhede ldquoThroughput opti-mized SHA-1 architecture using unfolding transformationrdquoin Proceedings of the 17th IEEE International Conference onApplication-Specific Systems Architectures and Processors (ASAPrsquo06) pp 354ndash359 Steamboat Springs Colo USA September2006

[13] H Michail A P Kakarountas O Koufopavlou and C EGoutis ldquoA low-power and high-throughput implementation ofthe SHA-1 hash functionrdquo in Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems (ISCAS rsquo05) vol 4pp 4086ndash4089 Kobe Japan May 2005

[14] S Badel N Dagtekin J Nakahara Jr et al ldquoARMADILLO amulti-purpose cryptographic primitive dedicated to hardwarerdquoin Cryptographic Hardware and Embedded Systems CHES 2010vol 6225 of Lecture Notes in Computer Science pp 398ndash4122010

[15] K Lin Y Zhang K Choi J Kang and S Hong ldquoMulti-purposebimodal cryptographic algorithm and its hardware implemen-tationrdquo in Proceedings of the FTRA International Conference onAdvanced IT Engineering and Management (FTRA AIM rsquo13)Seoul Korea 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article High Performance and Low Power Hardware ...downloads.hindawi.com/journals/ijdsn/2014/736312.pdf · Research Article High Performance and Low Power Hardware Implementation

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of