Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MAPLD 2008
XRTC Use of Fault Injection XRTC Use of Fault Injection to Simulate Upsets in to Simulate Upsets in Reconfigurable Reconfigurable FPGAsFPGAsGary Swift, Chen Wei Tseng, and Gregory Miller, Xilinx, Inc., Gregory R. Allen, Jet Propulsion Laboratory / Caltech, and Heather Quinn, Los Alamos National Laboratory
MAPLD 2008G.Swift et al., page 2
Overview
• Introduction - Reconfigurable FPGAs– Design-Level vs. Configuration-Level– Radiation Test Consortium (XRTC)
• XRTC Beam Tests - Methodology and Results• Verifying Redundant Designs• XRTC Fault Injector• Lessons Learned So Far• Future Directions
XilinxRadiation
TestConsortium
MAPLD 2008G.Swift et al., page 3
Introducing Virtex-4QV FPGAs
• Space-Grade Reconfigurable Family of Four – Guaranteed 300 krad(Si) and Latchup Immune– Bigger and More Powerful = More Complex
• Design-Level vs. Configuration Level– Triple Modular Redundancy XTMR
• Resides in Design-Level Providing Upset Robustness• Protects Both Levels, but many more Configuration Upsets
– Errors only on statistically “unlucky” coincident upsets• In two domains, same voted segment• During single scrub cycle (fraction of a second)
– SRL16s, LUTROM, and LUTRAM Cross Levels• Formerly forbidden, new Virtex-4 feature allows their use
MAPLD 2008G.Swift et al., page 4
XRTC Beam Tests
• Basic Philosophy: Continuous Monitoring with In-Beam Strip Charts of:– Design Functionality– Configuration Upsets
– Also Power and Temperature
DUT
FuncMonFPGA
Counter/Buffer
Host Computer
Counter/Buffer
ConfigMonFPGA
Host Computer
BEAM
MAPLD 2008G.Swift et al., page 5
XRTC Apparatus
BEAM
FPGAundertest
Testing at Texas A&M Cyclotron Institute in Air
MAPLD 2008G.Swift et al., page 6
XRTC Results – Space Upset Rates
• SEFI Rate is about one per century• Unprotected Designs: a few upsets per day in GEO• Mitigation makes these upset negligible• Robustness at one error per century (SEFI Rate) with:
– Design-Level: Triple Modular Redundancy• Assures no single-point of failure
– Config-Level: Configuration Management• Prevents upset accumulation (transparent to design operation)• SEFI detection logic triggers reconfiguration (intrusive)
MAPLD 2008G.Swift et al., page 7
XRTC Fault Injector
Requirements– Configuration-level Fault Injector (or Upset Simulator)– Speed and ease of comprehensive single-bit injection– Kernel command set allows any middle-ware approach
• Either hardware or software generated commands– No impact on DUT designs– Minimum impact on FuncMon design
• Only need to add-on error signalling to ConfigMon• Introduce an easy-to-adapt template for FuncMon add-on
MAPLD 2008G.Swift et al., page 8
XRTC Fault Injector
Same apparatus as for beam testing– Leverage ConfigMon functionality without breaking it – Kernel is add-on to ConfigMon– First priority – Inject as fast as possible
• Saves time by skipping intermediate “clean” frame– Requires three- way coordination
• Injector hands off to FuncMon after injecting fault• FuncMon tests functionality and reports results• Certain results cause ConfigMon to scrub or re-configure
– Scripting of kernel commands was natural addition
MAPLD 2008G.Swift et al., page 9
FaultMon GUI Addition
Same (new) ConfigMon
GUI with FaultMon GUI
“sidecar”
MAPLD 2008G.Swift et al., page 10
Three FaultMon GUI Controls
Stop at End of Part
Manual ControlOne operation at a time
Script ControlExecute a list of kernel commands
Auto ControlComprehensive single-bit fault injection
1. Choose STOP
condition(s)
2. Choose starting point
3. Kick it off
4. Observe it run
MAPLD 2008G.Swift et al., page 11
Fault Injection Lessons So Far
• Very useful to designers and beam testers• Found 9 SelectMAP SEFI bits• Found an I/O Test Bit on certain pairs of I/Os• Many problems trace to state machine implementation
– Modern synthesis tools may “optimize” in bad ways• Trimming “extra” states• Changing the type of state machine
• Still working out complications– Half-latches give inconsistent results– Not all detected single faults are “real”– Other inconsistencies being worked
MAPLD 2008G.Swift et al., page 12
Future Directions
• Near-term – Replace and augment beam tests– Simulate tri-flux test– Simulate multi-bit upsets (MBUs)
• Limitations of in-beam testing for robust TMR– Beam Time Required is Expensive – No Help with Locating of Problem Areas
• Longer-term – Expand to Flight Design Qual– May require expanded test platform
MAPLD 2008G.Swift et al., page 13
V-4QV TMR-Counter Results
37% full, no single points-of-failure
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03
Raw Bit Flip Rate (upsets/bit-sec)
Syst
em E
rror
Rat
e (e
rror
s/se
c)
Kevin Heldt &
Scott A. Anderson
Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
MAPLD 2008G.Swift et al., page 14
V-4QV TMR-Counter Results
37% Full, 237 Scrub Faults &
213 Scrub+Reset Faults
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03
Raw Bit Flip Rate (upsets/bit-sec)
Syst
em E
rror
Rat
e (e
rror
s/sy
stem
-sec
)
Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
MAPLD 2008G.Swift et al., page 15
V-4QV TMR-Counter Results
37% Full
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03
Raw Bit Flip Rate (upsets/bit-sec)
Syst
em E
rror
Rat
e (e
rror
s/se
c)
NO FAILs 450 FAILs
Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
MAPLD 2008G.Swift et al., page 16
41% Full, Partial Triplication4016 Scrub Fails,
0 Scrub+Reset Fails
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03
Raw Bit Flip Rate (upsets/bit-sec)
Syst
em E
rror
Rat
e (e
rror
s/sy
stem
-sec
)V-4QV TMR-Counter Results
Xilinx Confidential • Unpublished Work © Copyright 2009 Xilinx
MAPLD 2008G.Swift et al., page 17
Backup Material
• Virtex-4QV Devices and Features• Space Upset Rate Examples• Photos of XRTC Apparatus In Use
MAPLD 2008G.Swift et al., page 18
Virtex-4QV Devices
XQR4V XQR4V XQR4V XQR4V Description SX55 FX60 FX140 LX200 CFG* Configuration Bits* (millions) 15.4 14.5 34.5 43.0 BRAM Block Memory Bits 5,898,240 4,276,224 10,174,464 6,193,152 LOGIC Slices (2 Lookup Tables/slice) 24,576 25,280 63,168 89,088 DSP** 18x18 MACs** 512 128 192 96 PPC PowerPC405 Processors - 2 2 - DCM Clock Managers 12 12 20 12 MGT*** High-speed Transceivers*** - N/A N/A - IOBs Input/Output Blocks 640 576 896 960
* Only real memory cells in the Configuration Bit Stream are counted here (not counting BRAM) ** MAC=multiply-and-accumulate block for digital signal processing (DSP) *** MGTs are not supported for Virtex-4QV devices
Architectural Features
MAPLD 2008G.Swift et al., page 19
Example Space Upset Rates
Altitude ------------ XQR4V ----------- Orbit (km) Incl*
SX55 FX60 FX140 LX200 HI% 400 51.6º 0.73 0.69 1.61 2.03 69 LEO 800 22.0º 7.56 7.12 16.7 21.1 2
POLAR 833 98.7º 6.02 5.67 13.3 16.8 22 MEO 1200 65.0º 23.3 21.9 51.6 65.1 5 GEO 36,000 0º 4.28 4.03 9.5 11.9 94 * Incl = Inclination HI% = fraction from heavy ions
Configuration Cells
MAPLD 2008G.Swift et al., page 20
Example Space Upset Rates
BRAM Cells
Altitude ------------ XQR4V ----------- Orbit (km) Incl*
SX55 FX60 FX140 LX200 HI% 400 51.6º 0.72 0.52 1.24 0.75 84 LEO 800 22.0º 4.05 2.94 6.99 4.25 5
POLAR 833 98.7º 4.00 2.90 6.90 4.20 37 MEO 1200 65.0º 13.3 9.63 22.9 13.9 10 GEO 36,000 0º 4.49 3.26 7.75 4.71 98 * Incl = Inclination HI% = fraction from heavy ions
MAPLD 2008G.Swift et al., page 21
Example Space Upset Rates
Virtex-4QV SEFIs
Altitude ------------- SEFIs ------------ Orbit (km) Incl*
POR GSIG SMAP+ TOTAL HI%400 51.6º 1225 2161 1500 515 58 LEO 800 22.0º 100 114 112 36 13
POLAR 833 98.7º 131 165 146 49 14 MEO 1200 65.0º 32 37 35 11 3 GEO 36,000 0º 225 560 290 103 91
* Incl = Inclination HI% = fraction from heavy ions SMAP+ = SMAP & FAR SEFIs combined
MAPLD 2008G.Swift et al., page 22
Mature Test Methods & Apparatus
Parallel Cable
Parallel Cable
40-Pin Ribbon
Banana Cable
Rx
CFG FPGA
EXTERNAL MGTs
40 Pin IDE
CONNECTOR
S Test MICTOR CONNECTORS
Reset Switches
BNC POWER LUGS
Fan Pwr
FUNC FPGA
FUNCM ON SDRAM DIM M S
DUT FPGA
Programming Headers
Compact FLASH Header
P14 Teradyne Conn A
P15 Teradyne Conn B
Clock SMA Connectors
CFGM ON & FUNCM ON RS - 232 Connectors
1.5V NC
3.0V
2.5V
GND
3.3V
C FG M O N
V C C _B 6
FU N C M O N
V C C _B 3
FU N C M O N
V C C _B 2
FU N C M O N
V C C _B 0
FU N C M O N
V C C _B 1
FU N C M O N
V ref_B 0
FU N C M O N
V ref_B 1
G N D
BNC POWER LUGS DUT RS -232 Connector
Programming Proms & Flash
System ACE
CFGM ON SDRAM DIM M
GPIB Interface
PowerSupply Control
And SEL Monitor
ConfigurationMonitor
2 U HP6629 PowerSupply (Service)
CounterBoard
DIO Cable
Data Out40-Pin Ribbon
Control In40-Pin Ribbon
2 U HP6623 PowerSupply (DUT)
FunctionalMonitor
CounterBoard
DIO Cable
Rx
Rx
RxTxTx
Tx
Tx
Breakout Box
Rx
Data Out40-Pin Ribbon
Data Out40-Pin Ribbon
Control In40-Pin Ribbon
Tx
Breakout Board
Inside Vacuum Chamber
Tx
Rx
Data Out40-Pin Ribbon
Bulkhead
Readback /Programming
Laptop
Boeing
XilinxRadiation
TestConsortium
SEAKR
MAPLD 2008G.Swift et al., page 23
XRTC Beam Tests
• Static Results– Config cells– User BRAM & FFs– Functional Upsets
(aka SEFIs)– Both Protons
& Heavy Ions
• Dynamic & Mitigation Campaigns Underway
10-8
10-7
10-6
10-5
0 50 100Effective LET (MeV-cm2/mg)
Cro
ss S
ectio
n (c
m2 /d
evic
e)
SX55FX60LX200
POR SEFI
10-16
10-15
10-14
10-13
0 20 40 60 80 100 1Energy (MeV)
Cro
ss S
ectio
n (c
m2 /bit)
SX55FX60LX200
Configuration Cells
10-7
10-8
10-9
10-10
0 20 40 60 80 100 120Effective LET (MeV*cm2/mg)
Cro
ss S
ectio
n (c
m2 /b
it)
SX55FX60LX200
Configuration Cells
MAPLD 2008G.Swift et al., page 24
ConfigMonFPGA
FuncMonFPGA
XRTC ApparatusTesting at Texas A&M Cyclotron Institute in Vacuum Chamber
FPGAundertest
MAPLD 2008G.Swift et al., page 25
The TMR Verification Problem
• “Working” TMR may actually be broken– Stuck-at faults– Domain criss-crossing
• In the pathological case of only two working domains, a design’s error cross-section is double!
MAPLD 2008G.Swift et al., page 26
• Benchtop smoke test for three-leg functionality• In-beam tri-flux test (expensive and non-specific)
– Probability of a system error is approximately proportional to the square of upsets per scrub cycle
• Fault Injection (again)
The TMR Verification Problem
Counters
r (bit errors/bit-second)
1e-7 1e-6 1e-5 1e-4 1e-30.0001
0.001
0.01
0.1
1
10
datageneral approximationsmall-r form extrapolated
R (
syst
em e
rror
s / s
econ
d)