Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
A Globally-Asynchronous Locally-SynchronousVLSI Circuit for the SAFER Cryptoalgorithm
T.Villiger, J.Muttersbach, H.Kaeslin, N.Felber, W.Fichtner
Integrated Systems Laboratory, Swiss Federal Institute of Technology
ACiD-WG Workshop, Neuchatel, Switzerland
Funded by: Infineon Technologies, Germany
Philips Semiconductors Zurich, Switzerland
Commission for Technology and Innovation, Switzerland
IntegratedSystems
Laboratory
Outline
• GALS principle
• GALS building blocks
• Data channels
• SAFER cryptoalgorithm
• Architecture of the cryptochip
• Design flow
• Results
• Conclusion, Future work
IntegratedSystems
Laboratory Slide 1
GALS Principle
• Locally-synchronous (LS) modules perform all functionality
• Data are transferred in self-timed manner between LS modules
asynchronous wrapperasynchronous wrapper
data
handshake
locally-synchronousmodule
locally-synchronousmodule
Benefits:
➜ Facilitates clocking of SOCs
➜ Modularity enhances optimization and re-use of LS blocks
➜ Provides a hook for low-power operation
➜ Natural inclusion of totally asynchronous modules
IntegratedSystems
Laboratory Slide 2
The asynchronous wrapper
out
out�
pausable�clock
in
asynchronous wrapper�
in
out
in�
locally-synchronous�module
• Asynchronous port controllers allow for fast handshake processing
• Metastability of data is prevented by pausing the clock
• No extra latency (synchronizers, FIFOs. . . ) introduced
• Wrapper shall be assembled from predesigned elements
IntegratedSystems
Laboratory Slide 3
Pausable clock generation
• Ring oscillator for local clock generation
• Arbitration with Mutual Exclusion (ME) elements
• Programmable delayline
C
delay line
arbitration
ringoscillator
Ri�
Ai�
lclk
rclk
clkgrantME
�
delout�
Ri
lclk
Ai
rclk
clkgrant�
delout�
1 2 3 4
IntegratedSystems
Laboratory Slide 4
Delayline and Arbitration
Principle of delayline using slices Multiple Arbitration block
delayslice
fin�
fout�
binbout
delayslice
delayIn�
delayOut�
emergencymargin
delayslice
Ri1�
Ai1�
clkgrant
rclk
ME ME ME
Ri2�
Ai2�
Ri3�
Ai3�
• every slice can bypass rest ofdelayline
• delay adjustable over a wide range
• small delay increment (≈ 350psper slice with 0.25µm technology)
• safely arbitrates between incomingrequests and rising edges on rclk
• row of mutual exclusion elements
• scales linearly to multiple ports
IntegratedSystems
Laboratory Slide 5
GALS building blocks
• Port controller responsible for managing all data transfers
• Each unit is captured by a structural VHDL description
• Asynchronous port controllers achieve cycle times less than 350ps
• A Poll-type ports ask for clock stretching only to prevent metastability andensure data correctness: “proceed while waiting”
• Demand-type ports also ensure data integrity but stop the local clock as soonas they are enabled: “sleep while waiting”
• ROM (LUT) port: Interface at the ROM port is just a fake delay between Rpand Ap matching the data delay
• RAM access: Essentially a Demand-Out port with bidirectional data transfer(two data vectors in opposite direction controlled by a common handshake pair)
IntegratedSystems
Laboratory Slide 6
D-input port
0
4
Den+ Rp* / Ri+�
Ai+ Rp+ / Ap+�
Rp- /�
Ri-
Ai- /Ap-
Den- Rp* / Ri+Ai+ Rp+ / Ap+
Rp- /�Ri-
Ai- /�Ap-
1 2
7 3
56
D-input�AFSM
Den�
Ri Ai
Ap�
Rp�
Ta
Set�D
�
lclk
data�
• Performs 2-phase to 4-phase conversion
• Transfer acknowledge (Ta) indicates successful transfer
• 3D-tools available for synthesis (by K.Y. Yun)
IntegratedSystems
Laboratory Slide 7
D-input port (cont.)
Synthesis results
Ri = Rp Ri + Den Z0 + Den Ap Z0
Ap = Rp Ai + Ai Ap
Z0 = Rp Z0 + Ai Z0 + Den Rp Ap
2-level AND-ORImplementation
Ri Ap
Z0
AiRpDen Reset
Compact, fast and hazard-free implementation of async FSM
IntegratedSystems
Laboratory Slide 8
Data transfer mechanism
Clockgen1 Clockgen2
En1�
Ta1
lclk1
Ri1 Ai1
Rp�Ap
data1�
LS1�
LS2
Ri2 Ai2
En2�
Ta2
lclk2
data2�
D-Out�Port
P-InPort
Ai1�
Rp
Ap�
Ai2�
lclk1
lclk2
En1
En2
data2
Ri1�
Ri2
• The transfer channels all work with rendezvous scheme
• Push channels only, but pull channels would also be possible
• Data latches needed due to undetermined clock relation between the modules
IntegratedSystems
Laboratory Slide 9
Data channel simulation
0�
3�
0�
3�
0�
3�
0�
3�
0�
3�
0�
3�
0�
3�
0�
3�
10n�
20n�
30n�
40n�
50n�
time [s]�
Pen
1T
a1�
lclk
1
�
Rp�
Ap�
lclk
2
�
Ta2�
Den
2
�
P-out to D-inchannel:
• Clock stretchinginfrequent forP-ports
• Fast transferprocessing
• Local clocksrestart in phasewith data
IntegratedSystems
Laboratory Slide 10
SAFER cryptoalgorithm
key2
i-1
exp log� log� log� log�exp exp exp
2-PHT 2-PHT 2-PHT 2-PHT
2-PHT 2-PHT 2-PHT 2-PHT
2-PHT 2-PHT 2-PHT 2-PHT
key
2i
round inputs (8 bytes)
round outputs (8 bytes)
bytewise XOR�
bytewise add�
2-PHT
key2
r+1
ciphertext (8 bytes)
output transformation:
• Secret-key iterated block cipher
• Encryption and decryptionslightly different
• Byte oriented: blocks of 8 bytes
• Recommended number of rounds10 to 12
• Additional input/outputtransformation
• Comes with variable key lengths(Implemented version: SK-128)
IntegratedSystems
Laboratory Slide 11
Cryptographic operation modes
ECB Mode CFB mode
SAFERencryption
SAFERdecryption
p� c p�c
k�
k� -1 k
�
SAFERencryptionIV
k�
SAFERencryption
IV
p� c p�c
CBC mode OFB mode
k�
k� -1
SAFERencryption
SAFERdecryption
FIFO
p� c c
FIFOIV�
IV�
p�
k�
SAFERencryptionIV
p� ck
�
SAFERencryptionIV
p�c
p : Plaintext
c : Ciphertext
k : Key
IV: Initialization Vector
IntegratedSystems
Laboratory Slide 12
Design Example: MARILYN SAFER cryptochip
SAFER�
datapath
exp/log LUT
Mux2�16� 16�
subkey�RAM
RA
M in
terf
.
bias LUT�Key �
preparation
async. FIFO
Controller8
�
IV
�
mod
us�
Mux1�
DP
D
D D
D
DD
D
D
D
D
R
D
D D
P�
DP D
D
RD
D
D
keys
rea
dy
data in� data out�
control & user key in clk
clk�clk�
clk�
clk�
• SAFER blockcipher algorithm
• Supports ECB,CBC, CFB, OFB
• Implements bothencryption anddecryption
• 5 “true” clockdomains
• Synchronouscounterpartnamed“MERLIN”
IntegratedSystems
Laboratory Slide 13
Area figures
nominal (sync.) wrapper area wrapper area
cycle module w/o sync with sync
time [µm2] [µm2] [µm2]
controller 2.1ns 38 457 12 501 18 315
key prep 3.4ns 250 299 13 059 18 531
bias ROM 1.2ns 32 805 4 104 4 104
datapath 3.3ns 214 956 28 629 37 863
mux1 2.3ns 150 831 29 628 39 204
mux2 2.2ns 126 747 26 361 34 227
exp/log ROM 1.7ns 291 672 12 708 12 708
subkey RAM – 237 415 17 784 17 784
async FIFO – 34 182 – –
1 377 364 144 774 182 736
area 100% 10.5% 13.2%
IntegratedSystems
Laboratory Slide 14
Design flow
global tree generation
pads & pow
er infrastructure
block & cell placem
ent
�
xbm descriptions
�
logic equations
�structural V
HD
L�
system V
HD
L
synthesis
�
hierarchical Verilog netlist
�
block LE
Fblock D
EF
chip floorplan
�
top-level DE
F
flat DE
F
�
flat SDF
�
functionalverification
LV
S
GD
S II
DR
C
GD
S II out
PDP database
�
flat Verilog
non-timing driven
�
block layouting
block LE
Fblock D
EF
flattening
�
chip finishing
�
functionalverification
async. units
Cadence D
FII
�
Silicon Ensem
ble
Physical Design
Planner
3D�
Synopsys
�
design compiler
Silicon
�
Ensem
ble
timing-driven
block layouting
register-level
�
VH
DL
flat DFII layout
LS m
odules
�
GC
F constraints
IntegratedSystems
Laboratory Slide 15
MARILYN
• Technology:0.25µm, 5 metal, CMOS
• Core: 1.7 x 1.7 mm2
• Die: 2 x 2 mm2
• 66 pads (→ JLCC68 package)
• Contains extra testblocks forGALS components
• GALS overhead ≈ 10%
• Throuhgput 232Mbit/s at 10rounds ECB
• Max. throuhgput up to780Mbit/s (feedthrough)
IntegratedSystems
Laboratory Slide 16
Results
MARILYN MERLIN MERLIN GALS
(GALS) w/ clk-gating w/o clk-gating benefitth
rou
ghp
ut
EC
B
enc. 10 rounds 232 MBit/s 303 MBit/s −23%
12 rounds 194 MBit/s 255 MBit/s −24%C
BC
dec
. 10 rounds 227 MBit/s 303 MBit/s −25%
12 rounds 191 MBit/s 255 MBit/s −25%
ener
gy/
MB
it
ECB enc.555 nJ 737 nJ 973 nJ 25%
10 rounds
CBC dec.577 nJ 733 nJ 965 nJ 21%
10 rounds
IntegratedSystems
Laboratory Slide 17
Conclusion
• Complete methodology for GALS architectures developed
• Over 25% energy reduction (energy per MBit throughput)
• Area overhead below 10%
• Throughput up to 25% lower than synchronous version(due to datapath not running at full speed, reason currently being investigated)
• GALS building blocks:
– Consist mainly of technology independent structural VHDL
– Only mutual exclusion (ME) element requires cell design
• And most important: GALS works on silicon!
IntegratedSystems
Laboratory Slide 18
Future work
• Improve testability
• Extend the communication schemes beyond point to point links
• Adress system level aspects like deadlock analysis and system partitioning
• Improve tool flow support (hierarchical flow, timing verification, integration ofasynchronous tools)
• Possibility to withdraw a data transfer request (bus deferral)
• Power estimation theory
• Finer time-slice resolution of ring-oscillator
IntegratedSystems
Laboratory Slide 19