Upload
duongthu
View
218
Download
2
Embed Size (px)
Citation preview
Dependable System Design For Reconfigurable Safety-Critical
Applications PhD Student: Anees Ullah Research Group: CAD group
Advisor: prof. Luca Sterpone
1
Outline
• Dependability
• Re-configurability
• FPGA architecture and Slice Internals
• Problem Statement and goal
• Fine-grain Self-Repairing System (FGSR)
• Dynamically Reconfigurable TMR (DrTMR)
• ASIC fault emulation (FE)
• Conclusions
• Future Work
2
Dependability
• Reliable computations in today’s Nano-metric technology is challenging – Manufacturing defects (stuck-at, stuck-open,
bridge faults) – In-operational effects (radiations , electro-
migration, aging)
• Dependable system design techniques should be applied from RTL to implementation phases
• These mitigation strategies are assessed with testing and fault injection techniques to ensure the desired reliability levels are attained
3
Re-configurability
• FPGAs are electronic devices that can upgrade and change its functionality in-field
• Redundancy for error masking/detection combined with re-configuration for correction is used in critical system design
• Re-configurability can also be used for fault injection/emulation as a tool of dependability assessment
• SRAM-based FPGAs are highly sensitive to Soft Errors and have huge recovery times
4
• Rows (Clock Regions)
• Major Column
IOB
co
lum
n
BR
AM
co
lum
n
DS
P c
olu
mn
CM
T c
olu
mn
Clock Region
DS
P c
olu
mn
BR
AM
co
lum
n
IOB
co
lum
n
SRAM-based FPGAs: architecture & Reconfiguration
5
• Minor Column
Fram
e 0
Fram
e 35
Fram
e 1
Fram
e 25
CLB Major Column
20
CLB
Hig
h
• SLICE
6
1
0
PRECYINIT
1
0
1
0
1
0
AOUTMUX
BOUTMUX
COUTMUX
DOUTMUX
ACY0
BCY0
CCY0
DCY0
O5 AX
O5 BX
O5 CX
O5 X
O6
O6
O6
O6
LUT A
LUT B
LUT C
LUT D
BX
A
B
C
D
CX
DX
AX
A1
A6
.
.
A1
A6
.
.
A1
A6
.
.
A1
A6
.
.
M1
M2
M3
M4
X1
X2
X3
X4
COUT
AMUX
BMUX
CMUX
DMUX
FPGA’s Slice Architecture
Problem statement and Goal • To mitigate the effects of Multiple Bit Upsets
(MBUs) in the configuration memory and reduce the recovery times
• Fine-grain redundancy offers better mitigation properties, precise diagnosis of faults and fast error detection while fine-grain reconfiguration improves the recovery times
• The challenge is the overhead introduced by fine-grain approaches and the lack of design tools and methodologies
• The goal of this work is to exploit the FPGA’s primitive elements (LUT, carry-chains) to support fine-grain approaches to redundancy and reconfiguration
7
Fine-grain self-repairing (FGSR) • Self-repairing systems contain a Static and a
Dynamic region implemented on a reconfigurable platform
• The Static region consists of resources (processors, memories, I/O interfaces) that are always required for the operation of the system
• The Dynamic region contains reconfigurable modules that can change functionality during operational phase of the system called Partial Reconfigurable Modules (PRMs)
• The research aim is to increase the fault tolerance capabilities for PRMs with reliable and fast error-detectors having a low area-overhead
8
Matteo Sonza Reorda, Luca Sterpone and Anees Ullah, “An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems”, in IEEE European Testing Symposium (ETS 2013), May 27-31, 2013 (extended version submitted to IEEE Transaction on Computers )
FGSR - Methodology
9
DUT HDL
SYNTHESIS
(XST)
Net-list Extraction
CONSTRAINT GENERATOR
TRANSLATE & MAP
MAPPED NCD
XDL
XDL’ MAPPED NCD’
P & R PAR
BITSTREAM BITGEN
XDL MANI- -PULATOR
XDL2NCD
NCD2XDL
Regions Formation
Dynamic Region
Stat
ic R
egio
n Single Bit Error Region
Multiple Bit Error Region
LUT-based Error Detection
Carry-ch
ains d
etectors
Sin
gle
Bit
Err
or
Reg
ion
CLB
Co
lum
n
Column Logic Delay: ~ 23 ns
Column Routing Delay ~ 36 ns
OR LUT Tree
SLICE
SLICE
FLAG 1
Mu
ltip
le B
it E
rro
r R
egio
n C
LB C
olu
mn
Logic Delay: < 1 ns
Routing Delay:< 3 ns
Error A1
A3 A2
A4 A5 A6
0 1
A1
A3 A2
A4 A5
A6
1 0
LUT A
1 0
LUT B
DX
Multiple bit Error Detection
A1
A3 A2
A4 A5 A6
0 1
A1
A3 A2
A4 A5 A6
1
0
LUT A
1 0
LUT B
DX
Error
1
0
Single bit Error Detection
FGSR - Results
10
0
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
Nu
mb
er
of
up
sets
SEUs Wrong Answer (#)
SEUs Corrected (#)
MEUs Wrong Answer (#)
MEUs Corrected (#)
0,00
5.000,00
10.000,00
15.000,00
20.000,00
25.000,00
Re
cove
ry T
ime
in M
icro
seco
nd
s
DMR [µs]
TMR [µs]
Our Approach [µs]
0
5
10
15
20
25
30
35
Clo
ck P
eri
on
d in
Nan
ose
con
ds
DMR (ns)
TMR (ns)
Our approach (ns)
0
2.000
4.000
6.000
8.000
10.000
12.000
14.000
16.000
18.000
20.000
Nu
mb
er
of
LUTs
DMR (# LUTs)
TMR (# LUTs)
Our approach(# LUTs)
Dynamically Reconfigurable TMR (DrTMR)
11
Mixed placement of TMR domain increases the probability of Cross Domain Errors and makes partial Reconfiguration of TMR domains impossible
1. “On the Optimal Reconfiguration times of TMR circuits on SRAM based FPGAs”, in IEEE NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2013), June 25-27, 2013
2. Real-Time SEU Tolerant Circuits on SRAM-based FPGAs”, in 2013 IEEE Convention on Radiation Effects on Components and Systems (RADECS 2013) , Sep 23-27, 2013
3. “Recovery Time and Fault Tolerance Improvements for Circuits mapped on SRAM based FPGAs”, Journal of Electronic Testing: Theory and Applications (JETTA)
CDE-aware Voter placement - Voters from same Domain in a Slice Critical-Voter placement - Voters from different Domain in a Slice
TMR (X-TMR)
Hierarchy Formation
Constraint Placement
(Xilinx Placer)
Error Detection and
Flag Convergence
Floor-planning
(PlanAhead)
EDIF
EDIF
UCF
NCD
ROUTING
(Xilinx PAR)
Bitstream Generation
(Xilinx BITGEN)
DrTMR - Methodology
12
MAJ_V0 MIN_V0
Sig_D0
Sig_D1
Sig_D2
MAJ_V1
MAJ_V2
Primary MAJ_V0
Sig_D0 Sig_D1
Sig_D2
MIN_V0
Primary Sig_D0 Sig_D1 Sig_D2
MAJ_V0 Sig_D0
Sig_D1
Sig_D2
A1
A4
A2
A5
0 1
A3 A2
A4 A5
1
0
XNOR 1
1
0
XNOR 2
DX
SIG1
SIG1_MAJJ
SIG2_MAJJ
Primary Primary
DrTMR - Results
13
1
10
100
1000
10000
100000
Re
cove
ry T
ime
in M
icro
seco
nd
s
XNOR TMR (us)
Min-TMR (us)
Xilinx-TMR (us)
1
10
100
1000
10000
RIS
C5
x
CO
RD
ICR
2P
CO
RD
ICP
2R
MIN
IMIP
S
LEO
N II
L80
SOC
HP
16
RS_
DEC
FPU
DC
T
Nu
mb
er
of
Cro
ss-D
om
ain
Err
ors
XTMR Routing (# CDEs)
Min-TMR Routing (#CDEs)
XNOR-TMR Routing (#CDEs)
0,001
0,01
0,1
1
10
100
LUTs
(in
Th
ou
san
ds)
Xilinx TMR (# LUTs)
Min- TMR (# LUTs)
XNOR - TMR (# LUTs)
0
5
10
15
20
25
Clo
ck P
eri
od
in N
ano
seco
nd
s
X-TMR (ns)
XNOR-TMR (ns)
Min- TMR (ns)
Error Latency = 28 ns @ 192 CLBs Zero CDEs in Logic
1 LUT per Majority Voter
ASIC Fault Emulation (FE) • The simulation of a circuit in the presence of
faults is called fault simulation and it is a mandatory step to determine the yield of any VLSI fabrication process
• Fault Emulation is the hardware realization of fault simulation process in order to achieve speed
• The research aim of this activity is to exploit the flexibility of LUTs for fault injection purposes; emulating the ASIC permanent faults on FPGA
14
“Effective emulation of permanent faults in ASICs through Dynamically Reconfigurable FPGAs ”, in 24th IEEE International Conference on Field Programmable Logic and Applications Munich, Germany; September 2 - 4, 2014
FE - Methodology
15
Net-list Parsing
Duplication-free Map
Xilinx tools
EDIF
ASIC Logic Cones
Address lines bind
Fault Dictionary Algorithm
Physical Information
ASIC Fault List Extractor
ASIC Collapsed
fault list
Fault dictionary
Custom Mapping Phase
Fault Dictionary Creation Phase
ASIC Verilog Netlist
TetraMax
ASIC Verilog Netlist
A1
A3
A2
A4 A5
A6
LUT A
a b
c
d
e
f
g
h
i O6
I0
I1
I2
I3
I4
I5 j
O6= ( ( (I0 and I1)’ and I5)’ and (I2 and (I3 or I4))’))’
O6= ( ( (I0 and I1)’ and I5)’ and (I2 and (1))’)’
(I3 or I3’) and (I4 or I4’)
“f” @ 1
FE - Results
16
0,001
0,01
0,1
1
10
100
1000
10000
100000
1000000
10000000
Ad
der6
4
Mu
ltiplie
r32
Mu
ltiplie
r64
FIR filter
Hilb
ert transfo
rm
Fau
lt S
imu
lati
on
Tim
es
on
Lo
gSca
le
ASIC Tmodelsim(s)
FPGA Xilinx Mapper Tmodelsim (s)
FPGA Custom Mapper Tmodelsim (s)
FPGA Custom Mapper Tfpga (s)
0
2000
4000
6000
8000
10000
12000
Nu
mb
er
of
LUTs
Xilinx Mapper (# LUTs)
Custom Mapper (# LUTs)
0
20
40
60
80
100
120
140
De
lay
in N
ano
seco
nd
s
Xilinx mapper (Delay ns)
Custom mapper (Delay ns)
No re-compilation
Guaranteed emulation
Conclusions
• The fine-grain approaches to reconfiguration and redundancy simultaneously improves the fault tolerance and reconfiguration times
• These improvements are achieved at an affordable area and clock period overhead
17
Future Work
• As future a work, radiation testing of the developed mitigation techniques need to be conducted
• The developed CAD algorithms can be be optimized with respect to clock period, area and reconfiguration times
18