A Globally-Asynchronous Locally-Synchronous VLSI Circuit ...web.cecs.pdx.edu/~mperkows/CLASS_573/febr-2007/VilligerSlides.pdfA Globally-Asynchronous Locally-Synchronous VLSI Circuit

A Globally-Asynchronous Locally-SynchronousVLSI Circuit for the SAFER Cryptoalgorithm

T.Villiger, J.Muttersbach, H.Kaeslin, N.Felber, W.Fichtner

Integrated Systems Laboratory, Swiss Federal Institute of Technology

ACiD-WG Workshop, Neuchatel, Switzerland

Funded by: Infineon Technologies, Germany

Philips Semiconductors Zurich, Switzerland

Commission for Technology and Innovation, Switzerland

IntegratedSystems

Laboratory

Outline

• GALS principle

• GALS building blocks

• Data channels

• SAFER cryptoalgorithm

• Architecture of the cryptochip

• Design flow

• Results

• Conclusion, Future work

IntegratedSystems

Laboratory Slide 1

GALS Principle

• Locally-synchronous (LS) modules perform all functionality

• Data are transferred in self-timed manner between LS modules

asynchronous wrapperasynchronous wrapper

data

handshake

locally-synchronousmodule

locally-synchronousmodule

Benefits:

➜ Facilitates clocking of SOCs

➜ Modularity enhances optimization and re-use of LS blocks

➜ Provides a hook for low-power operation

➜ Natural inclusion of totally asynchronous modules

IntegratedSystems

Laboratory Slide 2

The asynchronous wrapper

out

out�

pausable�clock

in

asynchronous wrapper�

in

out

in�

locally-synchronous�module

• Asynchronous port controllers allow for fast handshake processing

• Metastability of data is prevented by pausing the clock

• No extra latency (synchronizers, FIFOs. . . ) introduced

• Wrapper shall be assembled from predesigned elements

IntegratedSystems

Laboratory Slide 3

Pausable clock generation

• Ring oscillator for local clock generation

• Arbitration with Mutual Exclusion (ME) elements

• Programmable delayline

C

delay line

arbitration

ringoscillator

Ri�

Ai�

lclk

rclk

clkgrantME

�

delout�

Ri

lclk

Ai

rclk

clkgrant�

delout�

1 2 3 4

IntegratedSystems

Laboratory Slide 4

Delayline and Arbitration

Principle of delayline using slices Multiple Arbitration block

delayslice

fin�

fout�

binbout

delayslice

delayIn�

delayOut�

emergencymargin

delayslice

Ri1�

Ai1�

clkgrant

rclk

ME ME ME

Ri2�

Ai2�

Ri3�

Ai3�

• every slice can bypass rest ofdelayline

• delay adjustable over a wide range

• small delay increment (≈ 350psper slice with 0.25µm technology)

• safely arbitrates between incomingrequests and rising edges on rclk

• row of mutual exclusion elements

• scales linearly to multiple ports

IntegratedSystems

Laboratory Slide 5

GALS building blocks

• Port controller responsible for managing all data transfers

• Each unit is captured by a structural VHDL description

• Asynchronous port controllers achieve cycle times less than 350ps

• A Poll-type ports ask for clock stretching only to prevent metastability andensure data correctness: “proceed while waiting”

• Demand-type ports also ensure data integrity but stop the local clock as soonas they are enabled: “sleep while waiting”

• ROM (LUT) port: Interface at the ROM port is just a fake delay between Rpand Ap matching the data delay

• RAM access: Essentially a Demand-Out port with bidirectional data transfer(two data vectors in opposite direction controlled by a common handshake pair)

IntegratedSystems

Laboratory Slide 6

D-input port

0

4

Den+ Rp* / Ri+�

Ai+ Rp+ / Ap+�

Rp- /�

Ri-

Ai- /Ap-

Den- Rp* / Ri+Ai+ Rp+ / Ap+

Rp- /�Ri-

Ai- /�Ap-

1 2

7 3

56

D-input�AFSM

Den�

Ri Ai

Ap�

Rp�

Ta

Set�D

�

lclk

data�

• Performs 2-phase to 4-phase conversion

• Transfer acknowledge (Ta) indicates successful transfer

• 3D-tools available for synthesis (by K.Y. Yun)

IntegratedSystems

Laboratory Slide 7

D-input port (cont.)

Synthesis results

Ri = Rp Ri + Den Z0 + Den Ap Z0

Ap = Rp Ai + Ai Ap

Z0 = Rp Z0 + Ai Z0 + Den Rp Ap

2-level AND-ORImplementation

Ri Ap

Z0

AiRpDen Reset

Compact, fast and hazard-free implementation of async FSM

IntegratedSystems

Laboratory Slide 8

Data transfer mechanism

Clockgen1 Clockgen2

En1�

Ta1

lclk1

Ri1 Ai1

Rp�Ap

data1�

LS1�

LS2

Ri2 Ai2

En2�

Ta2

lclk2

data2�

D-Out�Port

P-InPort

Ai1�

Rp

Ap�

Ai2�

lclk1

lclk2

En1

En2

data2

Ri1�

Ri2

• The transfer channels all work with rendezvous scheme

• Push channels only, but pull channels would also be possible

• Data latches needed due to undetermined clock relation between the modules

IntegratedSystems

Laboratory Slide 9

Data channel simulation

0�

3�

0�

3�

0�

3�

0�

3�

0�

3�

0�

3�

0�

3�

0�

3�

10n�

20n�

30n�

40n�

50n�

time [s]�

Pen

1T

a1�

lclk

1

�

Rp�

Ap�

lclk

2

�

Ta2�

Den

2

�

P-out to D-inchannel:

• Clock stretchinginfrequent forP-ports

• Fast transferprocessing

• Local clocksrestart in phasewith data

IntegratedSystems

Laboratory Slide 10

SAFER cryptoalgorithm

key2

i-1

exp log� log� log� log�exp exp exp

2-PHT 2-PHT 2-PHT 2-PHT



key

2i

round inputs (8 bytes)

round outputs (8 bytes)

bytewise XOR�

bytewise add�

2-PHT

key2

r+1

ciphertext (8 bytes)

output transformation:

• Secret-key iterated block cipher

• Encryption and decryptionslightly different

• Byte oriented: blocks of 8 bytes

• Recommended number of rounds10 to 12

• Additional input/outputtransformation

• Comes with variable key lengths(Implemented version: SK-128)

IntegratedSystems

Laboratory Slide 11

Cryptographic operation modes

ECB Mode CFB mode

SAFERencryption

SAFERdecryption

p� c p�c

k�

k� -1 k

�

SAFERencryptionIV

k�

SAFERencryption

IV

p� c p�c

CBC mode OFB mode

k�

k� -1

SAFERencryption

SAFERdecryption

FIFO

p� c c

FIFOIV�

IV�

p�

k�

SAFERencryptionIV

p� ck

�

SAFERencryptionIV

p�c

p : Plaintext

c : Ciphertext

k : Key

IV: Initialization Vector

IntegratedSystems

Laboratory Slide 12

Design Example: MARILYN SAFER cryptochip

SAFER�

datapath

exp/log LUT

Mux2�16� 16�

subkey�RAM

RA

M in

terf

.

bias LUT�Key �

preparation

async. FIFO

Controller8

�

IV

�

mod

us�

Mux1�

DP

D

D D

D

DD

D

D

D

D

R

D

D D

P�

DP D

D

RD

D

D

keys

rea

dy

data in� data out�

control & user key in clk

clk�clk�

clk�

clk�

• SAFER blockcipher algorithm

• Supports ECB,CBC, CFB, OFB

• Implements bothencryption anddecryption

• 5 “true” clockdomains

• Synchronouscounterpartnamed“MERLIN”

IntegratedSystems

Laboratory Slide 13

Area figures

nominal (sync.) wrapper area wrapper area

cycle module w/o sync with sync

time [µm2] [µm2] [µm2]

controller 2.1ns 38 457 12 501 18 315

key prep 3.4ns 250 299 13 059 18 531

bias ROM 1.2ns 32 805 4 104 4 104

datapath 3.3ns 214 956 28 629 37 863

mux1 2.3ns 150 831 29 628 39 204

mux2 2.2ns 126 747 26 361 34 227

exp/log ROM 1.7ns 291 672 12 708 12 708

subkey RAM – 237 415 17 784 17 784

async FIFO – 34 182 – –

1 377 364 144 774 182 736

area 100% 10.5% 13.2%

IntegratedSystems

Laboratory Slide 14

Design flow

global tree generation

pads & pow

er infrastructure

block & cell placem

ent

�

xbm descriptions

�

logic equations

�structural V

HD

L�

system V

HD

L

synthesis

�

hierarchical Verilog netlist

�

block LE

Fblock D

EF

chip floorplan

�

top-level DE

F

flat DE

F

�

flat SDF

�

functionalverification

LV

S

GD

S II

DR

C

GD

S II out

PDP database

�

flat Verilog

non-timing driven

�

block layouting

block LE

Fblock D

EF

flattening

�

chip finishing

�

functionalverification

async. units

Cadence D

FII

�

Silicon Ensem

ble

Physical Design

Planner

3D�

Synopsys

�

design compiler

Silicon

�

Ensem

ble

timing-driven

block layouting

register-level

�

VH

DL

flat DFII layout

LS m

odules

�

GC

F constraints

IntegratedSystems

Laboratory Slide 15

MARILYN

• Technology:0.25µm, 5 metal, CMOS

• Core: 1.7 x 1.7 mm2

• Die: 2 x 2 mm2

• 66 pads (→ JLCC68 package)

• Contains extra testblocks forGALS components

• GALS overhead ≈ 10%

• Throuhgput 232Mbit/s at 10rounds ECB

• Max. throuhgput up to780Mbit/s (feedthrough)

IntegratedSystems

Laboratory Slide 16

Results

MARILYN MERLIN MERLIN GALS

(GALS) w/ clk-gating w/o clk-gating benefitth

rou

ghp

ut

EC

B

enc. 10 rounds 232 MBit/s 303 MBit/s −23%

12 rounds 194 MBit/s 255 MBit/s −24%C

BC

dec

. 10 rounds 227 MBit/s 303 MBit/s −25%

12 rounds 191 MBit/s 255 MBit/s −25%

ener

gy/

MB

it

ECB enc.555 nJ 737 nJ 973 nJ 25%

10 rounds

CBC dec.577 nJ 733 nJ 965 nJ 21%

10 rounds

IntegratedSystems

Laboratory Slide 17

Conclusion

• Complete methodology for GALS architectures developed

• Over 25% energy reduction (energy per MBit throughput)

• Area overhead below 10%

• Throughput up to 25% lower than synchronous version(due to datapath not running at full speed, reason currently being investigated)

• GALS building blocks:

– Consist mainly of technology independent structural VHDL

– Only mutual exclusion (ME) element requires cell design

• And most important: GALS works on silicon!

IntegratedSystems

Laboratory Slide 18

Future work

• Improve testability

• Extend the communication schemes beyond point to point links

• Adress system level aspects like deadlock analysis and system partitioning

• Improve tool flow support (hierarchical flow, timing verification, integration ofasynchronous tools)

• Possibility to withdraw a data transfer request (bus deferral)

• Power estimation theory

• Finer time-slice resolution of ring-oscillator

IntegratedSystems

Laboratory Slide 19

Documents

A Globally-Asynchronous Locally-Synchronous VLSI Circuit ...web.cecs.pdx.edu/~mperkows/CLASS_573/febr-2007/VilligerSlides.pdfA Globally-Asynchronous Locally-Synchronous VLSI Circuit