Post Silicon Patchable HardwarePost-Silicon Patchable Hardware
Masahiro F jitaMasahiro Fujita
VLSI Design and Education Center (VDEC)VLSI Design and Education Center (VDEC)The University of Tokyo
July 22nd, 2011
Respin Statistics (North America)100%
80%ce
ss Respin is becoming more frequent
60% 48%44%on
Suc
c more frequent
40%
44%39%
33%st S
ilico
20%
Firs
1998 2000 2002 20040%
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 2
1998 2000 2002 2004[G. S. Spirakis, DATE 2006]
Manufacturing Cost$5M
$4MU
S$)
$3M
Cost
(U
$2M
sk S
et C
$1M
Mas Respin risk is
increasing
90nm 65nm 45nm 32nm
dramatically
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 3
90nm 65nm 45nm 32nm[Nikkei Electronics, 2008]
Causes for RespinsLogic/Function
Cl k91%
36%ClockFast Path
Slow Path
36%
32%
26%Slow PathDelay/Glitch
Power
26%
26%
21% Logic and functional errors Yield
AnalogFi
19%
19%
are the leading cause
FirmwareMixed Signal
IR Drop
17%
15%
15%IR Drop
0% 20% 40% 60% 80% 100%
15%
IC/ASIC Designs Having One or More Re spins by Type of Flaw
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 4
IC/ASIC Designs Having One or More Re-spins by Type of Flaw[Collett International Research 2005]
Conventional SoC Design FlowHigh-LevelDescription
Bug Fix
75% of the whole development time[Source: Intel 2007]
High-Level Synthesis
Machine-Generated
[Source: Intel 2007]Synthesis
Bug Localization
Logic Synthesis
RTL Bug Fix
BugVerification/Simul
ation Logic SynthesisPlace & Route
Pre-SiliconRTL Verification
Need to Understand RTL
Bug Localization
E
ation
Post-Silicon
SoCError
Detection
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 5
Design RTL Validation
Proposed Patchable SoC Design Flow
High-Leveli i
Bug FixDescription
g
Bug Localization High Level Error
Verification/Simulation
High-Level Synthesis of
Patchable HWB
Error Detection
Logic SynthesisPlace & Route
Pre-SiliconHigh-Level Verification
ation Bug Localization
Bug Fix
P t h
NoRespin
Needed!
Post-SiliconPatchable SoC
Patch Compilation
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 6
Design High-Level ECO
Proposed Patchable HardwareEfficeum
offers behavioral-level programmability
Custom Datapath
Patchable Controller
using a patchable controller
HardwiredHardwired PatchPatch
Patchable Controller
ALU1ALU1 ALU2ALU2
Hardwired FSM
Hardwired FSM
PatchFSM
PatchFSM
ALU1ALU1 ALU2ALU2
Partially-Programmable Circuit (PPC)offers logic-level programmability
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 7
offers logic level programmabilityusing a mixed gate/LUT circuit
Effice mEfficeum:An Energy-Efficient Patchable AcceleratorAn Energy-Efficient Patchable Accelerator
For Post-Silicon Engineering Changes
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 8
Energy Efficiency vs. Programmability
Energy Efficiency of 90nm OFDMEnergy Efficiency of 90nm OFDMFixed-function HW: 200GOPS/WE b dd d P 4GOPS/W 50X!Embedded Proc.: 4GOPS/W 50X!Laptop Proc.: 0.05GOPS/W 4,000X!
>100GOPS High Performance1W P /Th l C i〜1W Power/Thermal Constraints
Energy efficiency (in [GOPS/W] or [J/op])Energy efficiency (in [GOPS/W] or [J/op]) How much computation can be done in a given energy Sl i d th hi d b t t ffi i
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 9
Slowing down the chip reduces power but not efficiency
Fixed-Function Accelerator Achieves high energy efficiency by customization:Hardwired controller → No reprogrammabilityHighly-customized datapath → Low flexibility
Local StoreLocal Store
Hardwired ControllerHardwired ControllerReg
1Reg
1Reg
2Reg
2Reg
3Reg
3 ・・・
Control
Sparse Interconnect NetworkSparse Interconnect Network
Comp-t
Comp-t
Multi-li
Multi-liALU2ALU2ALU1ALU1 ・・・
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 10
aratorarator plierplier
Proposed Patchable Accelerator Behavioral reprogrammability by control patching Increased flexibility by adding register file via data bus
Local StoreLocal Store
Hardwired C t llHardwired C t llRegReg RegReg RegReg PatchPatch
y y g g
StoreStore ControllerControllerReg1
Reg1
Reg2
Reg2
Reg3
Reg3 ・・・
Patch LogicPatch Logic
Control
Sparse Interconnect NetworkSparse Interconnect Network
Control Bus
Sparse Interconnect NetworkSparse Interconnect Network
Data Bus
Comp-aratorComp-arator
Multi-plier
Multi-plierALU2ALU2ALU1ALU1
・・・
Register Register
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
aratorarator plierplier gFilegFile
11
Patch Logicer
PC1=?=?
Coun
te
PC2
>PCpatch?>PCpatch?
=?=?
ogra
m C ・
・
Sign
alHardwired ControllerHardwired ControllerPC1’
Pro
・・
・ ・・
ontr
ol S
Control PC2’
Co
MemorySignal
Memory
Program Counter Patch Control Signal Patch
P h M
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
Patch Memory
12
Patching Example (1/2)Scheduling Result of Initial Design
PC ALU1 ALU2 MUL1 Next PC
1 2
2 3
wir
edic
3 1 Har
dw log
4
ogic
5Dataflow graph for
Initial Design
Patc
h loFujita Lab. – VLSI Design and Education Center - University of Tokyo 13
Patching Example (2/2)Scheduling Result After Engineering Change
PC ALU1 ALU2 MUL1 NextPC
1 2 4
2 3
wir
edic
3 1 Har
dw logi
4 3
5Dataflow graph
After ECPatc
hlo
gic
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 14
Patching-Based Post-Silicon ECO Flow
P t Sili ECOPost-ECO Program
Post-Silicon ECO(Spec. Change & Bug Fix)
C Program
High-LevelS h i +
-x
<<x
+Synthesis
x
Computing the DifferenceBetween Two ProgramsFixed-Function HW
+ x<<
+Writing into
Patch MemoryInserting RF &
Patch Logic- <<
x
Patch Compilation
Patch
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
Patch CompilationEfficeum
15
Experimental SetupExample: 8x8 IDCTT h l F PDK 45Technology: FreePDK 45nmLogic Synthesis: Synopsys Design Compiler Ultrag y y p y g pHigh effort options with gated clock optimization
P&R Cadence SoC Enco nterP&R: Cadence SoC EncounterSimulation: Synopsys VCSy p yPower/timing analysis: Synopsys PrimeTime PXSi l ti lt d f l l tiSimulation results are used for power calculation
Energy efficiencies (GOPS/W) are compared
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 16
Energy Efficiency Comparison No Patching
Fully-Patched
6%
48%
89%
8x8 IDCT (FreePDK 45nm technology)
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 17
( gy)Offers a tradeoff between efficiency and programmability
Area & Performance Comparison20%5%
5XSmaller
Up to 40%Up to 40%Increase
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 18
Area Comparison
4x reduction4x reduction
18% increase
Fully-programmable accelerator
Single-functionHardwired accelerator Effi
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 19(Technology: FreePDK 45nm (NCSU/Nangate), Operating Frequency: 200MHz)
Hardwired accelerator Efficeum
Power Comparison
6x reduction6x reduction
13% increase
Fully-programmable accelerator
Single-functionHardwired accelerator Effi
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 20
(Technology: FreePDK 45nm (NCSU/Nangate), Operating Frequency: 200MHz)
Hardwired accelerator Efficeum
Incremental High Le el S nthesisIncremental High-Level Synthesis and Patch Compilationand Patch Compilation
For High-Level ECOg
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 21
Conventional High-Level SynthesisSeveral phases are applied separatelyThis prevents incremental synthesisThis prevents incremental synthesis
SchedulingAllocationBinding
+ x +
g
Step 1+ x+
AD
D1
AD
D2
MU
L1
SHFT
1
ADD1
ADD2
Step 1+ <<x
Step 2
Step 3
++
x<<
x
+MUL1
SHFT1
Step 1
Step 2
Step 3
Registers
FSM DatapathD t th FSM
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 22
DatapathDatapathDatapath
Incremental High-Level Synthesis Each operation is scheduled and bound incrementally,
and the hardware is enhanced accordinglyIncremental
Scheduling & BindingIncremental
Scheduling & Binding
Step 1+ +
AD
D1
AD
D2
MU
L1
SHFT
1
+ +
AD
D1
AD
D2
MU
L1
SHFT
1
Step 1Step 2
++ <<+ +
+ <<x
+Step 1
Step 2Step 3
Registers Registers
S
Registers
S
Registers
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 23
FSMDatapath
FSMDatapath
Patch Compilation Same as incremental synthesis, except the datapath
enhancement is not allowed (only FSM is enhanced)Incremental
Scheduling & BindingIncremental
Scheduling & Binding
Step 1+ +
AD
D1
AD
D2
MU
L1
SHFT
1
+ +
AD
D1
AD
D2
MU
L1
SHFT
1
Step 1Step 2
++ <<+ +
+ <<x
+Step 1
Step 2Step 3
Registers Registers
S S
Registers Registers
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 24
FSM FSMDatapath Datapath
Incremental Scheduling ProcedureExtension of Swing Modulo Scheduling for VLIW
compilers [Llosa et al., PACT ‘96]p [ , ]Conventional Scheduling
(Top Down)Incremental Swing Scheduling
(Mix of Top Down & Bottom Up)
der
dulin
g O
rSc
hed
Very long 6 registers Shorter register Only 4 registers
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 25
y gregister lifetimerequired lifetime
y grequired
Swing Scheduling ProcedureMix of top-down and bottom-up scheduling
Phase 1: Bottom-UpScheduling of
Critical Path and
Phase 2: Top-Down Scheduling of
the Descendants of
Phase 3: Bottom-UpScheduling of
the Ancestors ofTheir Ancestors Scheduled Operations Scheduled Operations
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 26
Incremental Step InsertionA novel technique enabling incremental swing
schedulingschedulingDuring swing scheduling, a new control step is
i d b h h d l d h d dinserted between the scheduled steps when needed
Step A
Step B
Step A
Step B
Step A
Step B
Step C Step D Step E
Step DStep C Step D
Step E
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 27
Incremental Binding ProcedureFor each operation, all possible combinations of
(resource registers) are examined(resource, registers) are examinedIf no such binding is found, new interconnects
b d i i d dbetween resource and registers are introducedEnhancedDatapathDatapath Datapath
IncrementalScheduling & Binding
1 2 1 T1
Datapath
Register
Datapath
Register
DatapathEnhancement
Step 1 + +
AD
D1
AD
D2
MU
L1
SHFT
Step 2 + <<xStep 3 A multiplier exists but
no register-to-multiplier
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 28
g pinterconnect exists
Experimental SetupThe proposed method has been implemented in
our high-level synthesis framework Cyneumour high level synthesis framework CyneumIncremental high-level synthesisP t h il f Effi iPatch compiler for Efficieum
Example: 5 benchmark designsC programs of about ~100 linesFunctions from IDCT ADPCM MPEGFunctions from IDCT, ADPCM, MPEGPost-ECO examples are generated by random graph
perturbation (next slide)perturbation (next slide)
Evaluated the quality of the method through the
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
patch size and compilation time29
Generating ECO ExamplesThe following graph perturbations are randomly
applied to CDFGapplied to CDFG
Perturbation 1: Opcode ChangeOriginal CDFG
Perturbation 1: Opcode Change
Perturbation 2: Operand Change Perturbation 3: Introducing
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 30
Perturbation 2: Operand Change Perturbation 3: IntroducingA New Operation
Evaluation of Patch Compiler For each benchmark, random CDFG perturbation is
applied M times. For each M, 100 different post-ECOapplied M times. For each M, 100 different post ECO designs are generated and then patches are compiled.
.]s)
ime
[sec
SM S
tate
lati
on T
i
Size
(#FS
e Co
mpi
l
Pat
ch S
Ave
rage
Ave
rage
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 31
ECO Size M (#Perturbations)
AA
ECO Size M (#Perturbations)
PPC:Increasing Yield Using
Partially Programmable CircuitsPartially-Programmable Circuits
A collaborative work withfProf. Shigeru Yamashita (Ritsumeikan Univ.)
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 32
Design For Yield At physical level, there are many techniques available
for yield/defect tolerance enhancementfor yield/defect tolerance enhancement At logic level, the following techniques can be applied
f h d lfor each module TMR: Voting DMR: If one module is defective, the other can be
used Reconfigurable devices: synthesize not to use
defective partsdefective parts
Too much overhead in area and performance
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 33
Objective of This WorkEnhance the defect tolerance by using a Partially-
Programmable Circuit (PPC)Programmable Circuit (PPC)PPC is a hybrid LUT/gate circuitTo correct a single defect, full programmability such
as FPGAs is unnecessaryA defective wire can be made redundant by
reprogramming LUTs in PPCp g g
Propose a design methodologyS th i f PPCSynthesis of PPCWhere to put LUTs
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
How to reprogram LUTs for defective wires34
PPC (Partially-Programmable Circuit)
Conventional
Non-Programmable Partconsisting of conventional gates
Conventionalgates
Primar
Primary Inputs
ry OutputLUT ts
LUTLUT
MUX LUT
Programmable Part
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 35
gconsisting of LUTs and MUXs
Defect Correction in PPCFind out that
a wire c is defectivea wire ci is defectiveConventionalgates
By reprogramming LUTs, the wire ci becomes redundant
LUT
wire ci becomes redundant
LUT
LUTLUT
Call ci as Robust Connection (RC)MUX LUT
i ( )
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 36
PPC Example: Initial Circuit
P
Prim
LUT LUTN1
NN4rim
ary In
mary O
utp
N2
N5 N6N7
N N
puts
uts
LUT LUT
N8 N9
LUTs are used partially in the circuit There is no redundancy now
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 37
PPC Example: Redundancy Addition 1
LUT LUT
Primary
Primary O
N1N2
N4
N N
LUT
y Inputs
Outputs
LUT
N5 N6N7
N8 N9
By adding this wire, some wires become Robust Connections (RCs)
RC: Wires which can become redundant by reprogramming LUTsNRC: Wires which are not RCsRC: Wires which can become redundant by reprogramming LUTsNRC: Wires which are not RCs
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 38
NRC: Wires which are not RCsNRC: Wires which are not RCs
PPC Example: Redundancy Addition 2
LUT LUT
Primar
Primary
N1N2
N4
N N
ry Inputs
y OutputsLUT LUT
N5 N6N7
N8 N9
LUT LUT
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 39
PPC Example: Redundancy Addition 3
LUT LUT
Primar
Primary
N1N2
N4
N N
ry Inputs
y OutputsLUT LUT
N5 N6N7
N8 N9
LUT LUT
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 40
PPC Example: Final CircuitSince we assume a single defect, multiple redundant connections to an LUT are multiplexed
MLUT LUTN3
p
Primar
Primary
MUX
N1N2
3N4
N N
ry Inputs
y OutputsLUT LUT
N5 N6N7
N8 N9
LUT LUT
Colored wires: Robust Connection (RC)Black wires: Non-Robust Connection (NRC)Colored wires: Robust Connection (RC)Black wires: Non-Robust Connection (NRC)
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 41
Black wires: Non Robust Connection (NRC)Black wires: Non Robust Connection (NRC)
CSPF: Flexibility of Logic CircuitsLogic function
10
10
100
**0
CSPF
g1001
001
0000
0
01
01
0*
0*0
f 0
010
111
*10
**1
01
f 101
011
01*
**0
0** 1
11011
10
Truth table*
** 0
1g2
1101
1100
100
1100
10*
10*
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
10 0*
42
SPFD: Flexibility of LUT Circuits
g1
100
g101 0
1f
L1
g1
11
101
L1g2
g2101
10
1 0 0g1 's flexibility by SPFDs
010
001
110
101
Fujita Lab. – VLSI Design and Education Center - University of Tokyo
0 1 0 1
43
Proposed Synthesis MethodBasic procedure
1 Perform a LUT mapping1. Perform a LUT mapping2. Determine LUTs to keepNeeds a better heuristicReconvergence points, non-critical nodes
3. Perform a technology re-mapping4 Adding redundant wires to LUTs4. Adding redundant wires to LUTsHow to find good wires?
fCurrently, wires are identified exhaustively
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 44
Preliminary Experiments1. Mapping to K-input LUTs (K=3,4,5)2. Re-mapping with keeping LUTs at the outputs2. Re mapping with keeping LUTs at the outputs3. For each LUT, if connecting a wire to the LUTs make another wire RC, then it is selected. Terminate if the number of LUT ,inputs is 6.
Pri
Prim
LUT
mary Inp
mary O
utp
LUT
uts
putsLUT
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 45
Experimental ResultsCircuit Stuck-at-0 Stuck-at-1
R b t N R b t N
#Connections which are robust to stuck-at-0/1
Robust Non-Robust
Robust Non-RobustOriginal Added Original Added
alu2 586 13 25 582 20 29
alu4 1093 28 92 1070 25 115
apex6 591 86 106 589 98 108
rot 421 161 218 411 172 228
too_large 472 15 256 453 9 275
d 1251 153 221 1318 213 154vda 1251 153 221 1318 213 154
C880 185 32 144 212 57 117
C1355 219 225 143 390 176 182219 225 143 390 176 182
C1908 626 37 66 599 36 93
C2670 710 59 176 704 66 182
Fujita Lab. – VLSI Design and Education Center - University of Tokyo 46
C3540 1500 63 210 1462 52 248