Upload
toby-carmon
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
IHPIm Technologiepark 2515236 Frankfurt (Oder)
Germany
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Circuit Design GALS Systems
Synchronous and GALS NoCs
- DAAD Workshop, Nis, Serbia, July 2009 -
Dr. Miloš Krstić
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Overview
• Motivation
• Problems of the synchronous design
• Asynchronous circuit design
• GALS - State of the Art
• Synchronous and GALS NoCs
2
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Challenges with Synchronous Design
• Most digital systems today operate synchronously.
• However, the complexity of electronic systems grows enormously.
Year
Property 1999 2001 2005 2011
CMOS process [m] 0.18 0.15 0.1 0.05
Transistors on chip [Mtrans/cm2] 7 14 41 247
On-chip clock [GHz] 1.25 1.77 3.5 10
Off-chip clock [GHz] 0.48 0.722 1.035 1.54
Power dissipation (handheld systems) [W] 1.4 1.7 2.4 2.2
Vdd [V] 1.5 1.2 0.9 0.5
3
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Classical Synchronous Paradigm
• Usually digital circuits are designed to work synchronously
R1 R2 R3CL3 R4CL4
CLK
CLK
CLK GATING SIGNAL
R1 R2 R3CL3 R4CL4
4
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Synchronous communication
• Clock edges determine the time instants where data must be sampled
• Data wires may glitch between clock edges (setup/hold times must be satisfied)
• Data are transmitted at a fixed rate - clock frequency
1 1 0 0 1 0
5
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Problems with Synchronous Design
• As clock speeds increase clock distribution becomes difficult:
We need to minimize clock skew.
There is some upper limit to clock speed that depends on the material properties of the device.
It is not possible to propagate a signal from one side of the chip to the other side within the single clock cycle
• Worst-case performance.
• Sensitive to variations in
Voltage, Temperature, Process.
• Not modular
(fixed clock rate: poor match for reusability of components).
• Clock burns large fraction of chip power (~40-70%)
• Synchronization failure.
6
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
What is Asynchronous Design ? (I)
• Synchronization is achieved without a global clock.
• Asynchronous Communication:
Handshake mechanisms
7
Sender Receiver
request
acknowledge
data
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
What is Asynchronous Design ? (II)
R1 R2 R3
CL3
R4
CTL CTL CTL CTL
CL4
REQ
ACK
R1 R2 R3CL3
R4CL4
LINK / CHANNELTOKEN FLOW
REQACKDATA
EXAMPLE:
8
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous design styles (I)
• Bundled data (Single Rail) 4 - phase protocol
This style is very widely used because of very small and fast asynchronous controllers
REQ
ACK
DATA
REQ
ACK
DATA
4 PHASE PROTOCOL:ALWAYS LIKE THIS
SOME VARIATIONS
n
9
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Bundled data
• Validity signal
Similar to an aperiodic local clock
• n-bit data communication requires n+1 wires
• Data wires may glitch when no valid
1 1 0 0 1 0
10
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous design stiles (II)
• Bundled data (Single Rail) 2 - phase protocol
This style looks simpler and faster than 4-phase, but controllers are more complex
REQ
ACK
DATA
REQ
ACK
DATA
2 PHASE PROTOCOL
n
11
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous design stiles (III)
• 4-phase dual rail protocol
Each data bit encoded into 2 wires
Offers generation of Delay-Insensitive circuits
Introduces very big area overhead
ACK
DATA
ACK
DATA
2n
EMPTY 0 0VALUE d.t d.f
VALID “0” 0 1VALID “1” 1 0Not used 1 1
EMPTY EMPTY EMPTYVALID VALID VALID
E 10
12
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Dual rail
• Two wires per bit
“00” = spacer, “01” = 0, “10” = 1
1 1
0 0
1
0
13
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous modules
• Signaling protocol:reqin+ start+ [computation] done+ reqout+ ackout+ ackin+reqin- start- [reset] done- reqout- ackout- ackin-
Data IN Data OUT
req in req out
ack in ack out
DATAPATH
CONTROL
start done
14
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous components
15
• Asynchronous design require additional components and special logic
• Such components are not available in standard synchronous design kit
• Critical components are C-element and Mutex
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Muller C-element
0 0 0A b z
0 1 no change1 0 no change1 1 1
16
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Mutual Exclusion element
• ME prevents multiple event propagation
ME is used for arbitration R1R2
G1 G2
MU
TE
X
R2
R1 G1
G2
x1x2
17
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Dual-rail logic
A.t
A.f
B.t
B.f
C.t
C.f
Dual-rail AND gate
18
• Dual-rail logic require additional logic for each logical operation
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Completion detection (dual-rail)
•••
•••
C done
Completion detection tree
19
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Completion detection (bundled-data)
•••
•••
delaystart done
logic
Conventional logic + matched delay
20
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Muller pipeline
• The” delay-insensitive handshake machine
• C[i] accepts 1/0 from C[i-1] only if C[i+1]=0/1
• Think of 1010101.. as waves: 10 10 10 1..
• The C-elements propagate waves precisely
• Timing depends on local delays, may vary along the pipe
• If RIGHT is quiet, pipe will fill and stall
C
C[i+2]
C
C[i+1]
C
C[i]
C
C[i-1]
ACK ACK ACK ACK
REQ REQ REQ REQ
ACK
REQ
ACK
REQ
LEFT
ACK
REQ
RIGHT
21
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Micropipelines (Sutherland 89)
L L L Llogic logic logic
Rin
Aout
C C
C C
Rout
Aindelay
delay
delay
22
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Abstract Pipeline
• Bubbles
• TokensValid (0 or 1, who cares) and Empty tokens
E V V E E
23
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Abstract Rings
• 3 stages, 1 bubble:
3 steps for token round
6 steps to cycle
V E V
V E E
V V E
E V E
token
bubble
24
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Building Blocks
Latch Source Sink
Fork Join(wait for all)
Merge(wait for one)
MUX
0
1
DEMUX
0
1
Function Block(Join; CL; Fork)
25
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Describing Asynchronous Cirsuit - STGs
A+
B+
A–
B–
A
B
A inputB output
26
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Control specification – C element
A+
C-
A-
C+A
C
B+
B- B
C
27
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Control specification – FIFO Controller
CC
RiRo
Ai
Ao
Ri+
Ao+
Ri-
Ao-
Ro+
Ai+
Ro-
Ai-
Ri Ro
Ao Ai
FIFOcntrl
28
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
A simple filter: specification
y := 0;loop x := READ (IN); WRITE (OUT, (x+y)/2); y := x;end loop
RinAin
Aout Rout
ININ
OUTOUT
filter
29
J. Cortadella - Introduction to asynchronous circuit design: specification and synthesis
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
A simple filter: block diagram
x y+
controlRin
Ain
Rout
Aout
Rx AxRy Ay Ra Aa
ININOUTOUT
• x and y are level-sensitive latches (transparent when R=1)• + is a bundled-data adder (matched delay between Ra and Aa)• Rin indicates the validity of IN• After Ain+ the environment is allowed to change IN• (Rout,Aout) control a level-sensitive latch at the output
30
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
A simple filter: control spec.
x y+
controlRin
Ain
Rout
Aout
Rx AxRy Ay Ra Aa
ININOUTOUT
Rin+
Ain+
Rin-
Ain-
Rx+
Ax+
Rx-
Ax-
Ry+
Ay+
Ry-
Ay-
Ra+
Aa+
Ra-
Aa-
Rout+
Aout+
Rout-
Aout-31
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
A simple filter: control impl.
Rin+
Ain+
Rin-
Ain-
Rx+
Ax+
Rx-
Ax-
Ry+
Ay+
Ry-
Ay-
Ra+
Aa+
Ra-
Aa-
Rout+
Aout+
Rout-
Aout-
C
Rin
Ain
Rx Ax RyAy AaRa
Aout
Rout
32
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Taking delays into account
x+
x-
y+
y-
z+
z- xz
yx’
z’
Delay assumptions:• Environment: 3 times units• Gates: 1 time unit
events: x+ x’- y+ z+ z’- x- x’+ z- z’+ y- time: 3 4 5 6 7 9 10 12 13 14
33
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Taking delays into account
x+
x-
y+
y-
z+
z- xz
yx’
z’
Delay assumptions: unbounded delays
events: x+ x’- y+ z+ x- x’+ y-
time: 3 4 5 6 9 10 11
very slow
failure !
34
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Gate vs wire delay models
• Gate delay model: delays in gates, no delays in wires
• Wire delay model: delays in gates and wires
35
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Delay models for async. circuits
• Bounded delays (BD): realistic for gates and wires.
Technology mapping is easy, verification is difficult
• Speed independent (SI): Unbounded (pessimistic) delays for gates and “negligible” (optimistic) delays for wires.
Technology mapping is more difficult, verification is easy
• Delay insensitive (DI): Unbounded (pessimistic) delays for gates and wires.
DI class (built out of basic gates) is almost empty
• Quasi-delay insensitive (QDI): Delay insensitive except for critical wire forks (isochronic forks).
Formally, it is the same as speed independent
In practice, different synthesis strategies are used
BD
SI QDI
DI
36
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Desynchronization - concept
• Start with synchronous design
• Replace clock with local handshake
• Use standard CAD tools
• Does not change datapath
• Guaranteed correctness
37* Eyal Friedman, Desynchronization - From Synchronous to Asynchronous design, Seminar in VLSI Architecture, Technion, Israel, Spring 2008
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Desynchronization - flow steps
• Main assumptions:
Normal Combinatorial logic, DFF
single clock
single clock edge
CL CLD-FF D-FFD-FF
CLK
38
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Desynchronization flow step #1
• Replace DFF by M+S latches
CL CLM SM S M S
CLK
CL CLD-FF D-FFD-FF
CLK39
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Desynchronization flow step #2
• Add matched delays • Respect bundling assumption
Delay > Tpd of CL
Delay serves as completion signal
CL CLM SM S M S
CLK
CL CLM SM S M S
Matched delay Matched delay
CLK
40
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Desynchronization flow step #3
• Replace clock by local handshake controllers
CL CLM SM S M S
Matched delay Matched delay
CLK
CL CLM SM S M S
Matched delay Matched delayctrl ctrl ctrl ctrl ctrl ctrl
41
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Why Asynchronous Design?
• We are used to sync designLogic and timing assumptions are simpler, but not true in realityCurrently it is very hard to solve big problems of synchronous design like clock skew, big power consumption, process variability ...
• Common arguments for asynchronous design:Low power ? High speed ? Low emission ? Low sensitivity to PVT (Process, Voltage, Temperature) variations ? High modularity (SoC) ? No clock distribution and timing problems (works) ? Secure chips ?
42
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Why not Asynchronous Design?
• Overhead (area, speed, power)
• Hard to designNon-decomposable to small combinatorial logic blocksConverting synchronous design to asynchronous typically fails
• Few CAD toolsThere is no real complete design-flow availableThere is only one commercial async EDA vendor available (Handshake Solutions) with very specific design flow (HASTE)
• Hard to testAsynchronous test methods are not present yet (or not mature enough), and it is difficult to go into any production without proper testing
43
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Available tools
• There are several tools available for automation of Asynchronous Design
• Mostly tools are developed at Universities• Two groups of tools: for synthesis of asynchronous controllers
and for design of the systems
• I group
Minimalist
Petrify
3D
II group
BALSA
TAST
HASTE
44
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Minimalist
• Developed at Columbia University
• “burst-mode” synthesis package
• based on synthesis of asynchronous FSMs
• integrates synthesis, testability and verification tools
• Good side
Produce Hazard-free control circuits
Contains several different algorithms for synthesis
Can provide generalized C-element based mapping and also behavioral Verilog
• Bad side
Doesn’t support arbitration and EBM
No optimal algorithm selection
45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Petrify
• Designed by J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, A. Yakovlev
• Synthesis of Asynchronous controllers defined as Petri Nets or Signal Transition Graphs (STG)
• Good side
Produce optimal Hazard-free control circuits
Can provide generalized C-element based mapping, complex-gate mapping and mapping to the technology libraries
• Bad side
Supports only asynchronous design, not mixed sync-async
With increased number of signals, synthesis time grows exponentially
Suitable for relatively small controllers
46
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
3D
• Produced by Kenneth Yun
• “Extended Burst-Mode” synthesis package
• Good side
Produce Hazard-free control circuits
Supports restricted multiple-input change (input burst) with don't-care inputs
Supports input choices based on sampling possibly glitchy signals
Suitable for mixed sync-async systems (like GALS)
• Bad side
No technology mapping
No optimal algorithm selection
No support and further development
47
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
TAST
• Produced by TIMA Laboratory, France
• TAST is compiler/synthesizer of Asynchronous digital circuits from high level communication description language
Input is CHP language, which can describe Petri Nets.
It is using VHDL as a format for behavioral and post synthesis simulation.
Produces QDI (dual-rail, 1-M code rail) circuits
• Good side
Produces complete asynchronous system and provides full design-flow
• Bad side
Uses QDI style, which gives very big area overhead
Gives not optimized output circuits
Not available in the moment
48
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
TAST Design flow
49
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
BALSA
• Produced by University of Manchester• BALSA is compiler/synthesizer of Asynchronous digital circuits
from high level communication description language
Input is BALSA language developed specially for this package
Produces Bundled data, Dual-rail, 1-M code rail circuits
• Good side
Produces complete asynchronous system and provides full design-flow
• Bad side
Gives large overhead compared with manual design (up to 300 %)
All tools are not freely available
50
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
BALSA Design Flow
51
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Philips
Philips developed its own full design flow based on TANGRAM language
Design flow also contains design for testability
Asynchronous Demonstrators
DCC error corrector - 1993-1994 - Low Power
80C51 - 1995 - Low Power, Low EMI
Smartcards - 1998 - Low Power, Security
DCC error corrector date area [mm2] power [mW]
synchronous 93 3.4 2.60
async (dual-rail) 93/05 7.0 0.41
synchronous 94 3.3 0.60
async (single rail) 94/09 3.9 0.08
52
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Philips 80c51 (I)
• Application - Pager baseband controller
First asynchronous C ever on the market
• Motivations for asynchronous solution of 80c51
Low power
Low EMI for easy integration
53
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Philips 80c51 (II)
• Low power issue
Circuit is only active when and where needed
54
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Philips 80c51 (III)
• Low current peaks
55
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Philips 80c51 (IV)
• Low EMI
56
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - RAPPID
• RAPPID - Revolving Asynchronous Pentium Processor Instruction-length Decoder
• Instruction Length Decoder was performance bottleneck in ca. 1995-vintage CISC processors
• Potential for optimization for common cases (RISC-like)
• Results
Developed a novel aggressive asynchronous method
About 3x throughput T=3x
About one half latency L=2x
About one half power P=2x
About same area A=0.8x
Namely, this is TxLxPxA 10 improvement
57
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous Success Stories - Amulet
• Amulet group is formed in Manchester University
• Amulet1 (1994)
60000 transistors in 1.0m, ARM6 instruction set
Half instruction throughput with same energy efficiency as ARM6
• Amulet2e (1996)
450000 transistors in 0.5m, ARM7 compatible
Still half the performance of a synchronous chip
• Amulet3i (2000)
800000 transistors in 0.35m, ARM9 compatible
Same performance as synchronous solution with an equal or marginally better energy efficiency
58
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Globally Asynchronous Locally Synchronous (GALS) Systems
59
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS Technique
60
• GALS is abbreviation for Globally-Asynchronous Locally-Synchronous systems.
• GALS techniques have the potential to solve some of the most challenging design issues of SoC integration of communication systems.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous wrapper
GALS method
Req
Ack
Data
SynchronousSynchronousblock 3block 3
SynchronousSynchronousblock 1block 1
SynchronousSynchronousblock 2block 2
Asynchronous wrapper
Asynchronous wrapper
Network Node
Network Node
Network Node
Data
• GALS can be used on ist own or within the NoC concept
61
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS as a Powerful Design Technique
• In the wireless communication systems GALS can approach the main design challenges.
• GALS makes data transfer between the blocks very easy.
• Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks.
• Decoupling of local blocks from central clock source reduces spectral noise considerably.
• Power saving is automatically integrated in asynchronous wrapper.
62
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Power reduction with GALS
DDAATTAAPPAATTHH MMEEMMOORRYY
CCOONNTTRROOLL,, II//OO
CCLLOOCCKK
Power distribution in high-
performance CPU
• Clock signal is the dominant source of power consumption .
• First estimations showed that about 30% of power savings could be expected in the clock net due to the application of GALS.
• Recently, some more pessimistic power estimation figures were presented
• GALS techniques offer independent setting of frequency and voltage levels for each locally synchronous module.
• When using dynamic voltage scaling (DVS), an average energy reduction of up to 30% can be reached
63
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Potential for reducing EMI with GALS
• We have simulated noise generated on the power supply line in the synchronous and request-driven GALS system.
dB
Frequency GHz
Frequency GHz
64
0.5 1 1.5 2 2.5 3 3.5 4 4.5
-20
-40
-60
-80
-100
-120
0.5 1 1.5 2 2.5 3 3.5 4 4.5
-20
-40
-60
-80
-100
-120
-140
dB
GALS introduces reduction of GALS introduces reduction of about 20 dBabout 20 dB
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS Opportunities – 3D Integration
• 3D Integration can be very interesting as the application field
SensorSensor
A/DA/D
MemoryMemory
DSPDSP
CommComm
65
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS Opportunities - NoCs
• Another interesting application can be Networks on Chips and MP SoCs (Multi-Processor System-on-Chip)
IP coremaster
IF
IFIP coreslave
switch
IP coremaster
IF
IP coremaster
IF
IFIP coreslave
IF IP coreslave
switch
switch
switch
Network on Chip
IP coremaster
IF IP coremaster
IF
IFIP coreslave
IFIP coreslave
switch
IP coremaster
IF IP coremaster
IF
IP coremaster
IFIP coremaster
IF
IFIP coreslave
IFIP coreslave
IF IP coreslave
IF IP coreslave
switch
switch
switch
Network on Chip
66
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS Opportunities – Process Scaling and Variability
• Asynchronous design gives average-case performance in comparison to worst-case performance of synchronous system
Variability on the Vth makes individual transistors faster or slower, more or less energy consuming.
65nmmin-size
VtNom
%Vth variability = +/- 30% (+/-3)
67
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS Methods
• GALS based on synchronizers
• GALS based on asynchronous FIFOs
• GALS based on pausible clocking
68
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS with the Synchronizers
req
ack
req
ack
Handshake Converter
2-phase handshake
4-phase handshake
data
Clockless domain
Clocked domain
clock
69
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS with FIFOs
Locally Synchronous
Module 1
Clock 1
F
IFO
Locally Synchronous
Module 2
Clock 2
Wr_clk
Wr_en
Data Data
Rd_en
Rd_clk
full
empty
Rd_valid
70
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Asynchronous wrappers
• GALS usually contains synchronous islands communicating with each other through asynchronous wrappers
• Asynchronous wrapper surrounds locally-synchronous islands
Wrapper consists of pausable clock and Input & Output ports
71
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Classical Pausible Clocking GALS approach
Locally Synchronous
Module 1
Local Clock
Generator1
Ou
tpu
t po
rt
Locally Synchronous
Module 2
Local Clock
Generator2
Inp
ut p
ort
Data
stretch1 stretch2
• Published in Jens Muttersbach et al., Globally-Asynchronous Locally-Synchronous Architectures to Simplify the Design of On-Chip Systems, In Proc. of ASIC/SOC Conference, pp. 317-321, Sept. 1999.
72
Asynchronous Wrapper 1
Asynchronous Wrapper 2
handshake
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Pausable Clock Generator
73
AARRBBIITTEERR
CC
AACCKKII11//22 RREEQQII11//22
LLCCLLKK DDEELLAAYY LLIINNEE
RRCCLLKK
SSTTOOPPII
RRCCLLKKDD
ccllkk__ggrraanntt
rrccllkk
rrccllkkdd
ffiinn ffoouutt
bboouutt bbiinn
DDEELLAAYY SSLLIICCEE
DDEELLAAYY SSLLIICCEE
DDEELLAAYY SSLLIICCEE
cccc11 cccc22 ccccnn
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Main challenges of the typical GALS methods
• In many solutions, the problems of data transfer and throughput is critical.
Most of them can perform data transfer every second clock cycle of the local clock.
• Some described circuits can theoretically transfer data every clock cycle.
However, the intensive stretching of the pausable clock generator will significantly diminish the practical performance.
• The latency of the transferred data is not known in advance and may vary significantly from one data transfer to the other one.
• It is not very practical to use the ring oscillators for local clock generation.
• All solutions are oriented towards a very general application.
They are not optimised for specific systems and environmental demands.
74
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Basic concept of the request-driven operation
• This approach covers point-to-point communication with very intensive but bursty data transfer.
• When receiving input burst, GALS block can operate in a request-driven mode.
• When there is no input activity, the data stored inside the locally synchronous pipeline has to be flushed out.
Then a local clock generator drives the GALS blocks.
• A Time-out function controls the transition from request driven operation to local clock generation mode.
75
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Request-driven asynchronous wrapper
• Local clock can be generated either internally or externally.
Locally
Synchronous Module
Local clock generation
Inp
ut
p
ort
Ou
tpu
t p
ort
Time-out detection
Han
dsha
ke
sign
als
Han
dsha
ke
sign
als
Asynchronous wrapper
Data
Data
request driven clock
local clock
76
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
What can we gain from this GALS technique?
• Reliable and fast transfer of large bursts of data is achieved. Data transfer is possible at every clock cycle of synchronous block.
• In request-driven mode operation there is no arbitration in input port. The circuit immediately responds to input requests.
• The clock speed is determined by the master and not by the slower participant in the communication.
• The local clock can be generated internally or externally.
• This proposed architecture offers an efficient power-saving mechanism, similar to clock gating.
• EMI should be reduced due to varying delays and frequencies in different asynchronous wrappers.
77
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Building the wrapper components - input port
78
RREEQQ__IINNTT
AACCKKEENN
RRSSTT
AACCKKCC
IINNPPUUTT CCOONNTTRROOLLLLEERR
RREEQQ__AA11
AACCKK__AA AACCKK__IINNTT
RREEQQII11 AACCKKII11
SSTT SSTTOOPP
• Input port has to provide control of the dataflow according to a ‘broad’ 4-phase handshake protocol.
• The input port consists of a speed-independent (SI) input controller along with few additional gates that have to provide glitch-free transitions of the input signals.
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Input controller specification
79
AACCKKCC--,, SSTT++ //
RREEQQ__AA11++ // RREEQQII11++
00
11
22 33
44
55
66 77
RREEQQ__AA11++ // RREEQQ__IINNTT++,,
RRSSTT++,, AACCKK__AA++
AACCKKCC++,, RREEQQ__AA11-- // RREEQQ__IINNTT--,, RRSSTT--,,
AACCKK__AA--
AACCKKCC--,, RREEQQ__AA11++ // RREEQQ__IINNTT++,,
RRSSTT++,, AACCKK__AA ++
SSTTOOPP++ // RRSSTT++
SSTTOOPP--,, SSTT--// RRSSTT--
AACCKKII11++ // AACCKKEENN++,,
RREEQQII11--
AACCKKCC-- // AACCKK__AA ++,, RRSSTT++
AACCKKII11--,, AACCKKCC++ //
88
RREEQQ__AA11--,, SSTT--// AACCKK__AA--,, RRSSTT--,,
AACCKKEENN--
99
RREEQQ__AA11++ // RREEQQ__IINNTT++,,
RRSSTT++,, AACCKK__AA++
ST+ /
• Input controller is modeled as an AFSM (asynchronous finite state machine).
• The controller is specified according to burst-mode requirements.
• Burst-mode AFSM is implemented as ‘Huffman Machine’ without explicit latches.
State graph of the input controller
Hazard-Free Combinational
Network
X
YZ
A
BC
outputsinputs
State (several bits)
Request-driven mode
Local clock generation mode
Transitional mode
Idle mode
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Input controller implementation
• Burst-mode input controller is synthesized using 3D tool that supports 2-level hazard-free logic minimization and achieves optimal state assigment:
REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN' REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN'
ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC'
ST' ACKEN'ST' ACKEN'
ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKENACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN
RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' + RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' +
REQ_A1 ACKC' ST' ACKEN'REQ_A1 ACKC' ST' ACKEN'
REQ_I1 = REQ_A1 ST ACKI1' ACKEN'REQ_I1 = REQ_A1 ST ACKI1' ACKEN'
Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0
• Logic equations are automatically converted into synthesizable structural VHDL code with our 3DC tool.
• Formal analysis of the asynchronous wrapper is performed.
80
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
VHDL description of a port
UN1: inv1x port map (ackc,t3);UN2: inv1x port map (st,t4);UN3: inv1x port map (clk1,t5);UN4: inv1x port map (req,t6);UN5: inv1x port map (ackeni,t7);UN6: inv1x port map (endi,t8);UN7: inv1x port map (z0,t9);UN8: inv1x port map (z1,t10);UN8i: inv1x port map (dvsi,t11);
U6: and2ix port map (reqci,ackc,t1);U7: and2x port map (req,reqci,t28);U8: and4x port map (req,t3,t4,t9,t12);U9: or3x port map (t1,t28,t12,reqcix);
U7i: and2x port map (req,reseti,t2);U7ii: and2x port map (st,acki,t31);U13: and3x port map (req,t3,z0,t13);U14: or5x port map (t1,t13,t12,t2,t31,ackix);
U10: and2x port map (ackc,ackeni,t14);U12: and2x port map (t9,ackeni,t15);U15: or3x port map (t15,t14,clk1,ackenix);
U11: and3x port map (st,t3,z0,t16);U19: or5x port map (endi,t1,t2,t12,t16,resetix);
U17: and2x port map (t7,t9,t17);U18: and3x port map (req,st,t5,t18);U20: and2x port map (t18,t17,reqiix);
U25: and2x port map (req,z0,t22);U26: and2x port map (st,z0,t23);U23: and3x port map (ackc,t5,ackeni,t21);U27: or4x port map (t21,t22,t23,endi,z0x);
U28: and2x port map (t6,ackc,t24);U29: and2x port map (ackc,z1,t25);U30: and3x port map (t6,t4,z1,t26);U32: or3x port map (t25,t26,t24,z1x);
entity and2x is port (a,b: in std_logic; c: out std_logic);end and2x;architecture struc of and2x isattribute DONT_TOUCH_NETWORK of a,b,c: signal is true;beginc<=(a and b) after 100 ps;end struc;
81
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Externally-driven GALS Wrapper
LLOOCCAALLLLYY SSYYNNCCHHRROONNOOUUSS
MMOODDUULLEE
CCMMUU
II NNPP
UUTT
PP
OORR
TT
OOUU
TTPP
UUTT
PP
OORR
TT
TTIIMMEE--OOUUTT DDEETTEECCTTIIOONN
HHaa n
ndd
ss hh
aa kk e
e ss i
i ggnn
aa ll ss
HHaa n
ndd
ss hh
aa kk e
e ss i
i ggnn
aa ll ss
AAssyynncchhrroonnoouuss wwrraappppeerr
DDaattaa__iinn DDaattaa__oouutt
rreeqquueesstt ddrriivveenn cclloocckk
eexxtteerrnnaallllyy ggeenneerraatteedd cclloocckk
EExxtteerrnnaall cclloocckk
AAddaapptteedd bblloocckk
RReeuusseedd bblloocckkss
82
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Clock Management Unit
EECCLLKK
eexxtteerrnnaall__cclloocckk
RREEQQII11 SSttrreettcchh AACCKKII11
ccllkk__ggrraanntt MMUUTTEEXX MMUUTTEEXX
MMUUTTEEXX
--
CC
++
CC
CC
SSTTOOPPII
AANNDD22
MM33
MM11 MM22
CC22 CC11
CC33
OORR22
IINNVV11
ssttee
ccgg
ssttii
83
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Baseband processor for WLAN
• The goal of one of our projects was to develop a wireless broadband communication system in the 5 GHz band.
• The modem is compliant with the IEEE802.11a WLAN standard
• System uses Orthogonal Frequency Division Multiplexing (OFDM) with data rates ranging from 6 to 54 Mbit/s.
• The synchronous baseband processor was implemented as an ASIC (700k gates).
84
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Structure of the synchronous baseband processor
• Baseband processor includes receiver and transmitter datapath structure.
• Very complex blocks are implemented such as Viterbi decoder, FFT, IFFT, CORDIC processors, ...
80 Msps block
20 Msps block
85
Baseband Processor
Transmitter
Receiver
Input buffer
Scram
blerS
ignal field generator
Encoder
Interleaver
Mapper
Pilot insertion
Pilot scrambler
IFF
T
Guard interval insertion
Pream
ble insertion
Synchronizerdatapath
Channel
estimator
Dem
apper
Deinterleaver
Viterbi decoder
Encoder
Interleaver
Mapper
Descram
bler
Parallel
converter
FF
T
Synchronizertracking
Buffer 20 - 80
Buffer 80 -20
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design challenges in the baseband processor
• Design of the baseband processor involves the challenges as:- several clock domains,- global clock tree generation, - large number of clock leaves (36 k flip- flops),- clock skew handling, - timing closure between the different modules, - clock gating, - power consumption, - EMI.
• Request–driven GALS architecture was developed as a possible solution for those problems.
86
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALS partitioning
Tx_1
Baseband Processor
Input buffer
Scram
blerS
ignal field generator
Encoder
Interleaver
Mapper
Pilot insertion
Pilot scrambler
IFF
T
Guard interval insertion
Pream
ble insertion
Synchronizerdatapath
Channel
estimator
Dem
apper
Deinterleaver
Viterbi decoder
Encoder
Interleaver
Mapper
Descram
bler
Parallel
converter
FF
T
Synchronizertracking
Buffer 20 - 80
Buffer 80 -20
Tx_2 Tx_3T
x_int
(async-syn
c interface)
Rx_3 Rx_2
Rx_1
Rx_in
t (asyn
c-sync in
terface)
To
ken rate
adap
tation
FIF
O T
A
Rx_TRAA
ctivation
interface
• The partitioning process has to take into account possible power saving.
80 Msps block
20 Msps block
Rate adaption block
Interface block
87
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Test strategy
• We are using a hardware tester which is strictly cycle based and cannot react to asynchronous output signals of the circuit.
• The GALS arbitration processes preclude cycle level determinism.
• We want to have a possibility to run very complex functional tests internally.
• Applied test technique should support system diagnosis.
• A test strategy based on Built-In Self-Test (BIST) is proposed.
• BIST reduces the effort for generating a test program and enables us to use a synchronous tester.
88
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design for Testability in GALS
• TPG and TDE are based on the linear feedback shift register structure with embedded additional logic.
• A central BIST controller performs control of the test procedure.
• We can run hierarchical tests.
• This BIST technique can be used as a method for prototype verification.
• In combination with the scan approach, BIST can be even used as a basis for the manufacturing test.
Tx_1 block
Tx_in
t
Rx_in
t
F
IFO
_TA
Activation interface
TDE0
TPG0
TDE2 TDE3 TDE4
TDE5
TDE6 TDE7
TDE8 TDE9
TDE10
TPG2 TPG1
TDE1
TPG4 TPG3 B
IST
inte
rnal lo
op
Tx_2 block
Tx_3 block
Rx_3 block
Rx_TRA block
Rx_2 block
Rx_1 block
89
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design flow
• We have used IHP 0.25 CMOS process.
• Asynchronous wrapper is equivalent to about 1.3 k inverter gates.
Only tunable clock generation is 0.9 k gates.
• Asynchronous wrapper has throughput up to 150 Msps in request driven mode and 100 Msps in local mode.
This application needs 80 Msps.
90
AFSM specifaction
3D - Logic synthesis
3DC tool – translation from 3D to structural
VHDL
Functional specification
VHDL description
Abstract behavioural simulation
Gate mapping
Realistic behavioural simulation
Timing driven synthesis
Postsynthesis simulation
Layout
Back annotation
Tape-out
Asynchronous wrappers
Synchronous blocks
Synopsys DC
Synopsys DC
Cadence Silicon Encounter
Model Sim
Model Sim
Model Sim
Model Sim
Power estimation
Prime Power
Power estimationPrime Power
Formal analysisLoLA
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Area and power distribution
• Area and power statistics are based on the synthesized netlist data.
Locally synchronous blocks occupy around 90% of the total area, The BIST circuitry requires around 3.5%, interface blocks 2.9%, and asynchronous wrappers 2%.
• Based on the switching activities, in the realistic transceiver scenario, power estimation with Prime Power tool has been performed.
Synchronous datapath logic uses most of the power (around 52.4%),then local synchronous clock trees are using 34.5%, async-to-sync interfaces 7%, and asynchronous wrappers 2.9%.
• After layout, the estimated power consumption is 324.6 mW.
91
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Implementational results
• Our GALS baseband processor
is fabricated and tested.
• The total number of pins is 120 and the
silicon area including pads is 45.1 mm2.
• Measured dynamic power dissipated in
the pure synchronous baseband processor
was 332 mW, and for the GALS baseband
processor slightly lower, at 328 mW.
Receiver
Transmitter
92
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Improving System Integration with GALS
• Synchronous baseband processor challenges:
- several clock domains,
- global clock tree generation,
- large number of clock leaves,
- clock skew handling,
- timing closure between blocks,
- clock gating.
93
Solved by GALS architectureSolved by GALS architecture
No global clock in GALSNo global clock in GALS
Clock leaves distributed over Clock leaves distributed over GALS blocksGALS blocksClock skew is reduced from Clock skew is reduced from 660ps to 486 ps660ps to 486 psCommunication between the Communication between the blocks through handshakingblocks through handshakingClock-gating embedded in the Clock-gating embedded in the asynchronous wrapperasynchronous wrapper
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
EMI measurement (I)
• The supply voltage variation spectrum of the inner processor core is measured.
0
-10
-20
-30
-40
-50
-60
-70
0 50 100 150 200 250 300 350 400 450 500
synchronous baseband processor GALS baseband processor
dB
MHz
94
~ 5 dB~ 5 dB
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
EMI measurement (II)
• Additionally, instantaneous supply voltage peaks are reduced from 140 mV (synchronous design) from cycle to cycle to the less than 100 mV (GALS).
• This reduction can be very important for mixed-signal designs and for secure systems.
• An application with fine-grained GALS partitioning can lead to results closer to theoretical maximum reduction.
95
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Conclusions
• There are several asynchronous design currently on the market
Asynchronous design is with greatest success used in the medium complexity - medium performance circuits
• Future applications
GALS, large networks on the chips (NoCs)
3D Integration
Some local blocks in the GALS then could be asynchronous
Asynchronous circuitry can provide lower EMI for SOCs
• Design & Test flow remains as a problem
96
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Synchronous and GALS Networks on Chips
97
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Synchronous and GALS NoCs
• Today on-chip design is more and more communication-centric
• Classical topologies are not sufficient (point-to-point, mesh, bus, etc.)
• Shared bus = low performance
Bandwidth is shared
Bus width (bits) relatively small
Global clock frequency limited
• Disadvantage of multiple buses
Not scalable, not generic
• Promising alternative could be Networks on Chip (NoCs)
• NoCs can be implemented completely synchronously, mesochronously, or in GALS fashion
98
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Paradigm
• Apply Networks Protocols to SoC
• Network:
Provides communication
Satisfy quality-of-service requirements:
Reliability
Performance: Throughput, latency, ..
Power ?
• Additional requirements unique to NoC
Energy bounds
Area
Fit it to the standard design flow
99
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Switching Network Basics
• Transport Layer: Msg end-to-end
Implemented using network adapters
Assembly and disassembly of the packets at source/destination
• Network Layer: Pkt end-to-end
Implemented using routers
Routers decide the routing path to destination
header of the packet
topology knowledge
Scalable distributed system: load shared between routers
• Data-Link Layer : Pkt over link
Packets: header, payload, trailer
Error correction (on packet): redundancy, error correction codes
* Technion - Asynchronous NoC - Nikolai Samolazov
100
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Bus vs. Network Arguments
BUS NoC
Scalability: Every IP adds parasitic capacitance
Only P2P connections
Timing is difficult Can be pipelined
Bus Arbiter performance Load shared by routers
Bandwidth: Limited and shared by all IP
Scales with network size
Latency: Zero when granted control Network latency always exists
Cost: Low area Significant area
Design Complexity:
Simple: well known and understood
Requires changes in HW and sometimes SW levels
101
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Hybrid Network
• Shared Busses as first level communication medium
• NoC routers as main communication devices
102
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Homogenous NoC
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
* NoC General Concepts - Andreas Ehliar - Per Karlström103
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Heterogeneous NoC
FU
FU
FU
FU
FU
MUL
ALU
DSP
104
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Heterogeneus NoC
FU
FU
FU
FUMUL
ALU
DSP
105
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Quality of Service
• Guaranteed latency
• Guaranteed bandwidth
• Correctness
106
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Architecture
FU FU
FU FU
FU
FU
FU FU FU
108
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Architecture
FU FU
FU FU
FU
FU
FU FU FU
109
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Design
• Architecture
Network Adapter and Router Architecture
- Asynchronous or synchronous
Network Topology
Routing Strategy
- Static Routing
- Adaptive Routing
Interconnect
- Repeaters
- Pipelining
• Design Technology
Tools and Methodologies
Simulation and (correctness, performance, power) Validation
- SystemC
111
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Flow Control
112
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Solving the global interconnect mess
Delay
Bit errors
Repeaters
Clock domains
• Create one optimized solution that can be reused
113
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Add flip flops to increase clock frequency
• What about ACKs?
NoCRouter
NoCRouter
114
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Add flip flops to increase clock frequency
• What about ACKs?
NoCRoute
r
NoCRoute
r
115
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Bit errors on long wires will not be avoidable in the future
• Use error correcting codes
Disadvantage: More wires, more throughput needed
• Use parity bits to discover errors
Resend damaged packets
No longer possible to guarantee real-time performance
116
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Possibility to create heavily optimized solution
Low voltage signaling
Advanced symbol encoding/decoding
Wave pipelining
117
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• High performance interconnect through wave pipelining
Need very careful analysis
NoCRoute
r
NoCRout
er
NoCRoute
r
NoCRout
er
118
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Design Issues - Long Wires
• Wave pipelining performance
3.45 GHz signaling on one bit line in 0.25 um
More energy efficient than regular pipeline
Faster than regular pipeline
• Disadvantage
Much harder to test/verify
119
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Network Topologies
• Mesh• Tree• Fat-Tree• Routing algorithm depends on topology
120
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Routing
• Routing: path from source to destination.
Must: deadlock free, livelock free
Livelock: message proceeds indefinitely, but never arrives
Possible only in adaptive non-minimal routing
Deadlock: packets waiting for each other in a cycle
• Three main categories:
Static (non-adaptive): predetermined path
Minimal fully adaptive: routes through any shortest path
Partially adaptive:
multiple routing paths
Some paths not shortest
121
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Wormhole Routing
• Header forwarded ASAP, not waiting to trailer
• Used in high-performance parallel computing networks (lumped)
Not in the internet (distributed)
• Packet may span several routers
Packet divided into flits (atomic flow control units)
• Main Disadvantage: cascaded contention
Packet requests busy link
VLSI routers: small buffers packet cannot be buffered in one router
Routers spanned by packet are stalled
Practical limitation, prevents achieving theoretical bandwidth
122
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Design Characteristics: Cost
• Area
Network components area
Wires, repeaters area
• Power
Energy per transmitted packet
Idle power
123
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Design Characteristics: Performance
• Latency [sec]
From header leaving source, to trailer reaching destination
Composed of waiting latency + network latency
Waiting Latency
Time message waits before entering the network
Network Latency
Time message travels inside the network
• Throughput [bits/sec]
Measured at network port
Average amount of user data that is accepted by the network on that port in a certain amount of time
• Aggregate Throughput [bits/sec]
Sum of the throughputs at all network ports
124
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Saturation
• Offered Load
Traffic produced by network clients as percentage of maximal network bandwidth
L : number of cycles needed to accept the message, D : average number of cycles between messages
• Saturation Threshold:
Offered Load at which average latency rises exponentially to infinite value
DL
LOL
125
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Cost - Performance Tradeoff
Santiago Gonzalez Pestana et al. “Cost-Performance Trade-offs in Networks on Chip: A Simulation-Based Approach”, DATE 2004
126
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Architecture of On-Chip Router
127
•Technion, Asynchronous vs. Synchronous Design Techniques for NoCs
•Robert Mullins, Asynchronous vs. Synchronous Design Techniques for NoCs
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Router Pipeline
• Numerous stages of Router Pipeline
• Raise communication latency
• Can make packet buffers less effective
• Incurs pipelining overheads
128
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Synchronous NoCs - Summary
• Can design high-performance single cycle routers
• Design is simplified by presence of global synchrony
• Distribution of global clock can be eased by
New clock generation / distribution techniques
Source synchronous communication
129
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Limitations of Fully-Synchronous Networks
1. Difficult to distribute clock
Network spread over die & may have irregular layout
Minimising skew costs complexity and power
• Alternatives/extensions to PLL and H-tree:
Clock deskewing techniques
Distributed Clock Generator (DCG).
Distributed PLLs
Standing-wave oscillators and rotary clock schemes
Resonant global clocks, optical clock distribution etc.
130
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Limitations of Fully-Synchronous Networks
2. Single Network Clock Frequency
Communicating synchronous IP blocks may operate at different and potentially adaptive clock frequencies
What is most appropriate network clock frequency?
131
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Why Asynchronous NoCs
• No clock distribution, simple solution• Networked IP blocks run at different clock frequencies
No synchronization issues at interfaces• Ability to exploit data / path-dependent delays
Low-latency common or high-priority paths through router• Freedom to optimize network links
Not constrained by need to distribute/generate multiple clock frequencies. Can exploit high-frequency narrow links
Dynamic latency/throughput trade-offs (adaptive pipeline depth)
Exploit dynamic optimizations on links (e.g. DVS)• Easy to use interfaces, modularity, Robust and simple
implementation, Reduced design time• Some arguments for reduced power
132
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Different NoC Architectures
• Router clocks derived from a single source
• Locally Generated Clocks (periodic & free-running)
• Synchronous Routers with Asynchronous Links
• Locally Clocked Routers / Asynchronous Interconnect (GALS style network)
Can support asynchronous interconnects
No longer exploiting periodic nature of router clocks
Correct operation is independent of the delay of the link
• GALS interfaces with pausible clocks
If necessary clock is stretched, data is always transferred reliably
Need to construct local delay line
• Local aperiodic clock generation
• Data-Driven Local Clock
Similarities to stoppable GALS interface and asynchronous priority arbiters
133
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Mesochronous Clocking
• Clock skew may force the system to be partitioned into multiple clock domains
• Can exploit the fact that only the phase of each router’s clock differs, simple error-free clock-domain crossing possible (single clock source)
134
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Router clocks derived from a single source
• Each router’s clock may be generated from the global network clock, either by:
Clock division or
Clock multiplication
• Clock domain crossing techniques can exploit known clock frequency relationships
Chakraborty and M. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains”, In Proceedings ASYNC’03
L. F. G. Sarmenta, G. A. Pratt and S. A. Ward, “Rational Clocking”, ICCD’95
135
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Using Synchronisers for GALS NoCs
• Asynchronous channel uses 4-phase bundled data protocol
A. Sheibanyrad, A. Greiner, Two efficient synchronous asynchronous converters well-suited for networks-on-chip in GALS architectures, 2005
136
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Locally Generated Clocks (periodic & free-running)
• Can exploit knowledge about clocks (when crossing clock domains) even if all we know is that they are periodic, examples:
predictive synchronizers [Dally][Frank/Ginosar]
asynchronous FIFOs [Chakraborty/Greenstreet]
137
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Using Asynchronous FIFOs in GALS NoCs
• Synchronous network wrapper assembly/disassembly data packets
• Can connect many independent clock domains
138
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC architecture for low power
• NoC concept together with GALS methodology gives good opportunities for power saving
• Each hardware block in NoC system can be setted to the optimal frequency/voltage
• Best is to combine DVFS with GALS concept in order to reduce power
139
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC architecture for DVFS – LETI Solution (NoCs 2008)
• A fully asynchronous Network-on-Chip
• IP units are synchronous islands using programmable Local Clock Generator
• Within the IP unit
Synchronization is done thanks to Pausable Clock
A Power Unit manages internal Vcore generated using external Vhigh and Vlow
A Network Interface is in charge of
NoC communications
Local Power Management
• Main CPU in charge of global power management
140
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
DVFS with GALS NoCs
• Each synchronous IP is an independent power and frequency domain
• A local fine grain Dynamic Voltage Scaling:
Implementation of a local hardware controller to control transitions between Vhigh and Vlow
Ensures smooth DVS transitions for IP safe computation
• A local fine grain Dynamic Frequency Scaling:
Automatic frequency scaling
Use of clock generation re-programming to find the optimal V/F point of operation
• Thanks to pausable clock technique, IP unit continues its operation during DVFS phases
• GALS architecture and local clock generation is a natural enabler for easy local DVFS
141
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
NoC Unit architecture
• Each IP core encapsulated with
Network Interface
Test Wrapper
Pausable Clock
Power Supply Unit
• IP units have 5 supply modes
Init: reset at Vhigh (1.2V)
High: Vhigh supply
Low: Vlow supply (0.8V)
Hopping: switch Vhigh / Vlow for DVFS
Idle: retention state at Vlow (no clock)
Off: stand-by mode
142
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Local Power Manager
• Local Power Manager handles unit power modes
• A set of programmable registers, through the NoC
• Configuration of
Programmable delay line
Power Supply Unit
• Pulse Width modulator used to control the Hopping mode
143
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Power Supply Unit
• Power Supply Unit manages Vcore
• Two power switches Thigh and Tlow LVT transistors
• A Hopping Unit
• An Ultra Cut-Off Generator
144
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Hopping Unit
• Energy per operation scales with V²
Decrease Voltage (and Frequency) to be energy efficient
• «Triple state» power supply
Use of two PMOS power switches
Vhigh (1.2 V), Vlow (0.7 V), or OFF (0 V)
• Switch between Vhigh and Vlow
Transitions take less than 100 ns
Mean speed / mean power of the IP is programmed by a PWM
• Compatible with synchronous and asynchronous IPs
For GALS system: coordination done with local clock generator
• Can easily be integrated in any CMOS circuit
No inductor contrary to traditional DC/DC converters
No capacitor contrary to charge pump implementation
145
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Ultra Cut-Off Generator
• When reverse polarizing the gate, the leakage current goes through a minimum
• The optimal polarization point varies with the temperature, the supply voltage and the process corners
• The proposed UCO generator automatically polarizes the gate of the Power switch to its point of minimum leakage
• Compensates for temperature variation, alleviates corners variations.
• The gate oxide reliability is considered by introducing a passive stress reduction mechanism
146
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Pausable Clock Interface
• Pause temporary the clock when a transfer (NoC) or a supply switch is required
• Based on
Two GALS ports : Synchronous-to Asynchronous and Asynchronous-to-Synchronous
A programmable delay line
A pausable clock generator
• Pausable Clock Generator arbitrates pause requests
147
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Pausable Clock Interface
• Programmable delay line
Precise, small and low power
Using Standard cells
On the same unit power domain
148
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Power Gain
• Programmable delay line matches with unit logic on the same power domain
Compensates any mismatch thanks to re-programmation
• Power reduction
Vhigh=1.2V and Vlow=0.8V
35 % dynamic power reduction between High and Low modes
Hopping mode is used to save power without any latency cost
Leakage power thanks to UCO is reduced by 2 decade
• Power Supply Unit efficiency
Hopping Unit
Only resistive losses in the power transistors
About 1 mW dynamic power
=> more than 95 % power efficiency
90 % total efficiency (external DC-DC taken into account)
An adaptive and reliable Power Supply Unit giving high power reduction factor and high power efficiency
149
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Physical Implementation
• Power Switch
One single Power-Switch for the complete power domain
Sized to get a speed loss<5%
Area : about <5% of the power domain
• Hopping Unit
Area : 140μm*35μm
Hopping Transition : <100 ns
150
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Synchronous or Asynchronous?
• A clock less on-chip network appears to be an elegant solution although some questions remain:
Test
Performance concerns
Shouldn’t asynchronous designs offer latency advantages?
Fast local control, path/data dependent delays, DI interconnects
Perhaps asynchronous routers mimic synchronous architectures too closely?
Exploit flexibility, novel architectures, different topologies
Overheads for data-driven clocking or GALS currently look small in comparison to the classical approach
• Synchronous design has advantages too
Predictability and determinism can be exploited
Fast single cycle routers possible
Global snapshot of state is good for scheduling • Still lots of interesting research to be done
151
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
GALAXY project
• GALAXY project (GALS InterfAce for CompleX Digital SYstem Integration) is funded in the FP7 program of EU
• www.galaxy-project.org
152
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com © 2009 - All rights reserved
Project goals
• This project builds on a technology approach in which the EU currently has world leadership
• We are on the way to provide an integrated GALS NoC design flow
• We will provide an interoperability framework between the existing open and commercial CAD tools
• The project is evaluating the ability of the GALS approach to
solve system integration issues,
implement a complex GALS system on 40 nm CMOS process,
explore the low EMI and low-power properties,
and robustness to process variability problems.
153