High-Speed Implementation of the KECCAK Hash Function on FPGA

International Journal of Advanced Computer Science, Vol. 2, No. 8, Pp. 303-307, Aug., 2012.

Manuscript Received: 14,Sep., 2011

Revised:

25,Jan.,2012

Accepted:

5,Mar.,2012

Published:

15,Sep., 2012

Keywords

SHA-3,

KECCAK Hash

Function,

Unrolling

Method,

Pipeline

Register,

High-Speed

Implementation,

Abstract Because of the weakening

of the widely-used SHA-1 hash algorithm

and concerns over the similarly-structured

algorithms of the SHA-2 family; the US

NIST has initiated the SHA-3 contest in

order to select a suitable drop-in

replacement. In this paper we review

KECCAK hash function’s algorithm and

apply several methods to improve the

performance with respect to throughput,

frequency and timing. In trying to improve

any of these parameters one may

adversary affect the other factors.

Different architectures are coded in VHDL

and implemented on FPGAs and are

compared in terms of speed.

1. Introduction

In today’s modern world of e-mail, internet banking,

online shopping, and other sensitive digital

communications, cryptography has become a vital tool for

ensuring the privacy of data transfers. Hash functions

operate at the root of many popular cryptographic methods

in current use, such as the Digital Signature Standard

(DSS), Transport Layer Security (TLS) and Internet

Protocol Security (IPSec) protocols, numerous random

number generation algorithms, encryption algorithms,

all-or-nothing transforms, and password storage

mechanisms [1].

As cryptographic algorithms become more widely

used, the need for high-speed implementations of these

algorithms increases. Software-based implementations of

cryptographic algorithms fall short in performance in many

applications, e.g. on heavily loaded servers. Therefore, an

obvious need for high-speed implementations exists.

In many of these cryptographic schemes, the

throughput of the incorporated hash functions specifies the

throughput of the system. Especially in applications where

transmission and reception rates are high, any latency or

delay on calculating the digital signature of the data packet

leads to degradation of the network’s quality of service[2].

Reprogrammable hardware is an almost ideal choice

A. Gholipour is with the Iran University of Science and Technology,

(: [email protected]).

S. Mirzakuchaki is with the Department of Electrical Engineering, Iran

University of Science and Technology, (: [email protected]).

for cryptographic implementations because high speed can

be achieved without significant reduction in flexibility.

Flexibility, meaning that the design can be easily

changed or modified, is of especially great importance in

cryptographic implementations for the following reasons.

First, a cryptographic algorithm can be considered secure

only until proven otherwise. If a severe flaw in an algorithm

is found, the algorithm must be replaced with a more secure

one. Second, in many applications, a large variety of

different algorithms are in use, and therefore, it should be

easy to change from one algorithm to another.

Following the weakening of the widely-used SHA-1

hash algorithm and concerns over the similarly-structured

algorithms of the SHA-2 family, the NIST has set up the

SHA-3 competition with the goal of identifying one (or

more) modern hash functions which can act as a drop in

replacement for the SHA-2 family [3].

KECCAK hash function is one of these candidates

accepted by NIST for the SHA-3 hash function competition.

In this paper we describe the implementation of the

KECCAK on FPGAs.

The paper is organized as follows, section 2 presents

the KECCAK algorithms and in section 3 describes some

techniques that increase the speed of the implementation

and the result comes in section 4. Finally, conclusions are

offered in section 5.

2. KECCAK Algorithm

KECCAK is a family of hash functions that are based

on the sponge construction and use as a building block a

permutation from a set of 7 permutations. There are 7

KECCAK -f permutations, indicated by KECCAK -f[b],

where lb 225 and l ranges from 0 to 6. KECCAK

-f[b] is a permutation over bZS 2 , where the bits of s are

numbered from 0 to b - 1. b is the width of the

permutation. These KECCAK -f permutations are iterated

constructions consisting of a sequence of almost identical

rounds. The number of rounds nr depends on the

permutation width, and is given by lnr 212 ,

where 25/2 bl . This gives 24 rounds for KECCAK

-f[1600].

The KECCAK Hash function produces a final digest

message of 256 bits, which is dependent on the input

message, composed of multiple blocks of 1024 bits each.

The input message block is XORed onto a part of the

current state and the result is passed through the KECCAK

High-Speed Implementation of the KECCAK

Hash Function on FPGA Atefeh Gholipour & Sattar Mirzakuchaki


International Journal Publishers Group (IJPG) ©

304

-f permutation. The KECCAK algorithm consists of 3

stages: (i) initialization and padding; (ii) absorbing phase;

and (iii) squeezing phase. A pseudo code for this algorithm

is depicted below [4, 5].

KECCAK[r, c, d](M)

- Initialization and padding

)4...0,4...0(),(0],[ inyxyxS

||010||)8/(||)(||010|| xrbytedbytexMP

000||... x - Absorbing phase

PinPbolckeveryfor i

],5[],[],[ yxPyxSyxS i

wryxthatsuchyx /5),(

)]([ ScrfKECCAKS

- Squeezing phase

requestedisoutputWhile

],,[|| yxSZZ

wryxthatsuchyx /5),(

)]([ ScrfKECCAKS

Zreturn

The state is logically grouped into a 5×5 matrix of

64-bit words. The KECCAK-f permutation consists of 24

rounds, which are identical except for the addition of a

round-dependent constant. Each round has five steps (θ, ρ,

π, χ and τ), which feature simple logical operations and

permutations of the state bits. The initial state is all zero and

in each round the introduced data is mixed with the current

state.

ooooR

4...0]4,[]3,[

]2,[]1,[]0,[][:

inxxAxA

xAxAxAxC

4...0

)1],1[(]1[][

inx

xCROTxCxD

)4....0,4...0(),(

][],[],[

inyx

xDyxAyxA

y

x

y

x

yxryxaROTyxA

3

1

2

0

)),(],,[(],[:

y

x

Y

X

yxaYXA

3

1

2

0

],[],[:

)4....0,4...0(),(

]),2[

],1[(],[],[:

inyx

yxBAND

yxBNOTyxByxA

RCAA ]0,0[]0,0[:

Here the following conventions are in use. All the

operations on the indices are done modulo 5. A denotes the

complete permutation state array and A[x, y] denotes a

particular lane in that state. B[x, y], C[x] and D[x] are

intermediate variables. The symbol denotes the bitwise

exclusive OR, NOT the bitwise complement and AND the

bitwise AND operation. Finally, ROT(W, r) denotes the

bitwise cyclic shift operation, moving bit at position i into

position i + r (modulo the lane size).

The constants r(x,y) are the cyclic shift offsets and are

specified in the following table.

TABLE 1

VALUE OF OFFSET IN STEP

The constants RC[i] are the round constants. The

following table specifies their values in hexadecimal

notation for lane size 64 and shown in TABLE 2.

TABLE 2 VALUE OF RC[I] CONSTANT

x=3 x=4 x=0 x=1 x=2

y=2 153 231 3 10 171

y=1 55 276 36 300 6

y=0 28 91 0 1 190

y=4 120 78 210 66 253

y=3 21 136 105 45 15

RC[0] 0x0000000000000001

RC[1] 0x0000000000008082

RC[2] 0x800000000000808A

RC[3] 0x8000000080008000

RC[4] 0x000000000000808B

RC[5] 0x0000000080000001

RC[6] 0x8000000080008081

RC[7] 0x8000000000008081

RC[8] 0x000000000000008A

RC[9] 0x0000000000000088

RC[10] 0x0000000000008082

RC[11] 0x000000080000000A

RC[12] 0x000000008000808B

RC[13] 0x800000000000008B

RC[14] 0x8000000000008089

RC[15] 0x8000000000008002

RC[16] 0x800000000000808B

RC[17] 0x8000000000000080

RC[18] 0x000000000000800A

RC[19] 0x800000008000000A

RC[20] 0x8000000080008081

RC[21] 0x8000000000008080

RC[22] 0x0000000080000001

RC[23] 0x8000000800008008

Atefeh Gholipour et al.: High-Speed Implementation of the KECCAK Hash Function on FPGA.


305

3. Speed Optimization Techniques

In this section a discussion is given about methods for

architectural speed optimization in an FPGA. There are

three primary definitions of speed depending on the context

of the problem: throughput, latency, and timing[6].

In the context of processing data in an FPGA,

throughput refers to the amount of data that is processed per

clock cycle. A common metric for throughput is bits per

second. Latency refers to the time between data input and

processed data output. The typical metric for latency will be

time or clock cycles. Timing refers to the logic delays

between sequential elements.

Several techniques have been proposed to improve the

implementation. The most relevant are:

A. Unrolling Technique

Unrolling technique optimize the data dependency. An

unrolled architecture implements multiple rounds of the

core compression function in combinational logic, thereby

reducing the number of clock cycles required to compute

the hash. This comes at the cost of an increase in area. The

number of rounds unrolled in the algorithm, k, must be a

divisor of the total number of rounds, n, of the algorithm.

Thus the number of clock cycles to execute the algorithm

decreases by a factor of k. The goal is to increase the

minimum clock period by a factor smaller than k, thus

allowing for shorter latency and higher throughput [7].

B. Embedded Memories

Usage of embedded memories for storing required

constant values.

C. Pipelining Techniques

Pipelined design conceptually works very similar to an

assembly line in that the raw material or data input enters

the front end, is passed through various stages of

manipulation and processing, and then exits as a finished

product or data output. The beauty of a pipelined design is

that new data can begin processing before the prior data has

finished. Due to highly dependent data computation the

resulting throughput is usually not improved and more

complex control logic is required.

D. Add Register Layers

The architectural for timing improvements is to add

intermediate layers of registers to the critical path. This

technique should be used in highly pipelined designs where

additional clock cycle latency does not violate the design

specifications, and the overall functionality will not be

affected by the further addition of registers.

4. Implementation Result

It is possible to design different architectures of

KECCAK. We will describe the high-speed core design

depicted in Fig. 1 [5].

In this configuration the core will be capable of

processing 128 bytes in 24 clock cycles.

The core is composed of three main components: the

round function, the state register and the input/output buffer.

The I/O buffer allows the core to compute the absorbing

phase while the words of the next block are transferred

through the bus. This allows running the absorbing phase

while the bus is transferring the next block to be processed.

An alternative for saving area is to execute the storing of the

words composing the block directly in the state register.

We consider also two architectures, the architecture

using pipeline register and the unrolling technique. In the

first architecture the core will be capable of processing 128

bytes in 48 clock cycles and the last take 16 clock cycle to

process the 128 byte.

Fig. 1 The high-speed

The last architecture of the core is illustrated in Fig. 2.

The I/O buffer allows the core to compute the absorbing

phase while the words of the next block are transferred

through the bus. Two R blocks consist of the round function

and each of them works in different clock cycle.

Fig. 2 Unrolling architecture of the Hash



306

R1 and R2 in two clock cycles perform three rounds.

The control signals aren’t shown in fig. 1, these signals

determine the status of the buffer to be input or output mode

and specify which of the R blocks are active in one clock

cycle. The processing of a complete message block requires

16 clock cycles.

The presented hashing cores were captured in VHDL

and were fully simulated and verified using the Model

Technology’s ModelSim Simulator.

We have used Altera Quartus II and Xilinx ISE to

evaluate VHDL with the tools for FPGA. These tools

provide estimations of the amount of resources needed and

the maximum clock frequency reached [8,9].

The throughput is calculated by:

ClockCycle

frequencyMaxBlocksizeThroughput

.

(Equ. 1)

Block size is 1024.

To applying the unrolling method we decrease the

number of the clock cycle which essential to complete the

round function that is 24 for original implementation. In this

case we decrease the clock cycle to 16 clocks which equal

to the number of clock cycle used for reading inputs.

When adding the registers in combinational path, the

frequency of the circuit increases. In this implementation

we use one register layer.

The result of these implementations on various FPGA is

shown in Table 3, 4.

TABLE 3 PERFORMANCE ESTIMATION ON ALTERA STRATIXIII EP3SE50F484C2

TABLE 4

PERFORMANCE ESTIMATION ON VIRTEX 5 XC5VLX50FF324-3

Another important issue in hardware implementation

is the occupied space. In hardware implementation this

parameter illustrate by number of registers and logics which

used. Number of registers and logics was used for each

implementation shown in Table 5 and 6.

TABLE 5

NUMBER OF USED REGISTER FOR EACH ARCHITECTURE

TABLE 5

NUMBER OF USED LOGICS FOR EACH ARCHITECTURE

As seen from above tables when increase the

throughput and maximum frequency, occupied space also

increase. In adding register layers method for frequency

increasing, the ratio of hardware usage to frequency

increasing is low and not acceptable.

5. Conclusion

In this paper we review KECCAK hash function’s

algorithm and apply several methods to improve the

performance with respect to throughput, frequency and

timing. In trying to improve any of these parameters one

may adversary affect the other factors. Different

architectures are coded in VHDL and implemented on

FPGAs and are compared in terms of speed.

Different methods were coded in VHDL. The most

important method for increasing throughput is unrolling a

loop that applys to our architecture and for increasing the

frequency we add register layers in critical path; as we

explained the unrolling method has the highest throughput

and the penalty is an increase in area.

Altera StratixIII

EP3SE50F484C2

Virtex5

XC5VLX50FF324-3

Original

Architecture

4304 (38000)

ALUTs

1434 (7200)

Slices

Register

Layers

Architecture

14402 (38000)

ALUTs

2636 (7200)

Slices

Unrolling

Architecture

5633 (38000)

ALUTs

1562 (7200)

Slices

Max Freq.

(MHz)

Requirement

Clock Cycle

Throughput

(Gbit/s)

Original

Architecture 230.57 24 9.83

Register

Layers

Architecture

382.85 48 8.17

Unrolling

Architecture

212.49

16 13.59

Max Freq.

(MHz)

Requirement

Clock Cycle

Throughput

(Gbit/s)

Original


Register

Layers

Architecture

146.649 48 3.13

Unrolling


Altera StratixIII

EP3SE50F484C2

Virtex5

XC5VLX50FF324-3

Original

Architecture 2641 (38000) 2640 (28800)

Register

Layers

Architecture

2641 (38000) 4242(28800)

Unrolling

Architecture 4250 (38000) 2652(28800)

Atefeh Gholipour et al.: High-Speed Implementation of the KECCAK Hash Function on FPGA.


307

References

[1] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, "Optimisation of the SHA-2 family of hash functions on FPGAs," IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06), pp. 317-322, 2006.

[2] Jae-Bong Yoo, Byung-Ki Kim, Ho-Min Jung, Taewan Gu, Chan-Young Park, Young-Woong Ko. "Efficient Pipelined Hardware Implementation of RIPEMD-160 Hash Function". International Journal of Electronics, Circuits and Systems. Volume 2 Number 2 Spring 2008.

[3] National Institute of Standards and Technology (NIST). Cryptographic Hash Algorithm CompetitionWebsite. http://csrc.nist.gov/groups/ST/hash/sha-3.

[4] G. Bertoni, J. Daemen, M. Peters, G. Van Assche. Keccak specifications. http://keccak.noekeon.org/Keccak-specifications.pdf

[5] G. Bertoni, J. Daemen, M. Peters, G. Van Assche. "Keccak sponge function family main document" http://keccak.noekeon.org/Keccak-main-2.1.pdf

[6] Steve Kilts, "Advanced FPGA Design: Architecture, Implementation, and Optimization" Wiley-IEEE Press 2007.

[7] Roar Lien, "FPGA Implementations of SHA-1 Secure Hash Standard" Thesis, 2003.

[8] J. Str• ombergson, "Implementation of the Keccak hash function in FPGA devices", http://www.strombergson.com/files/Keccak_in_FPGAs.pdf.

[9] G. Bertoni, J. Daemen, M. Peters, G. Van Assche. Keccak Hardware implementation in VHDL. File archive. December 2008. http://keccak.noekeon.org/KeccakVHDL-1.0.zip

Documents

High-Speed Implementation of the KECCAK Hash Function on FPGA