Upload
ross-wood
View
217
Download
0
Embed Size (px)
Citation preview
Swift 1 P173/MAPLD 2005
Upset Susceptibility Upset Susceptibility and and
Design Mitigation Design Mitigation
of of
PowerPC405 Processors PowerPC405 Processors Embedded in Embedded in
Virtex II-Pro FPGAs Virtex II-Pro FPGAs
Swift 2 P173/MAPLD 2005
AuthorsAuthors
Gary SwiftJet Propulsion Laboratory/California Institute of Technology
Gregory AllenJet Propulsion Laboratory/California Institute of Technology
Jeffrey GeorgeThe Aerospace Corporation
Swift 3 P173/MAPLD 2005
AuthorsAuthors
Sana RezguiXilinx Corporation
Carl CarmichaelXilinx Corporation
Fayez ChayabMDRobotics
Swift 4 P173/MAPLD 2005
AbstractAbstract
We show recent results for the upset susceptibility of the registers and caches in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively, these upsets can dominate the system error rate.
We consider several techniques for implementing various levels of redundancy to reduce system errors, including single-, dual- and triple-chip options. We conclude that the dual-chip option may often be the best choice and warrants further study.
Swift 5 P173/MAPLD 2005
Background - Reconfigurable FPGA Background - Reconfigurable FPGA UpsetsUpsetsThe basic building blocks are soft to upset [Ref. 1]
Config
1E-10
1E-9
1E-8
1E-7
0 10 20 30 40
LET (MeV per mg/cm2)
Cro
ss S
ectio
n (c
m2 /b
it)
BRAM
Configuration Cells and Block RAM XQR2VP40
Swift 6 P173/MAPLD 2005
Background - Upset MitigationBackground - Upset Mitigation
Critical applications require design-level upset mitigation
• Design Triplication– The use of TMR (or triple modular redundancy) in a design
allows correct function through triplicated majority voters even when a configuration element is upset.
– The extra design effort is now largely automated by new software (TMRtool).
• Active Configuration Scrubbing– Upsets in the configuration must not be allowed to
accumulate or TMR will “break”– Scrubbing uses some resources, but can be implemented
so that it is transparent to system operation.
Swift 7 P173/MAPLD 2005
Embedded “Hard-Core” Processor(s) Embedded “Hard-Core” Processor(s) Upset Upset
PowerPC 405 cores in Virtex II-Pro family FPGAs offer unprecedented computational power inside an FPGA, but include additional upsetable storage elements
Ones
1E-10
1E-9
1E-8
1E-7
1E-6
0 10 20 30 40
LET (MeV per mg/cm2)
Cro
ss S
ectio
n (c
m2 /b
it)
Zeros
General Purpose Registers XQR2VP40 embedded PPC405 core
Swift 8 P173/MAPLD 2005
Processor Upsets – Data Cache Processor Upsets – Data Cache Processor caches are very important features for increased performance; however, upsets in the caches can lead to system errors.
Ones
1E-10
1E-9
1E-8
1E-7
0 10 20 30 40
LET (MeV per mg/cm2)
Cro
ss S
ectio
n (c
m2 /b
it)
Zeros
Data Cache XQR2VP40 embedded PPC405 core
Swift 9 P173/MAPLD 2005
Processor Upset MitigationProcessor Upset Mitigation
The “obvious” solution of implementing TMR with three processor cores is not an available single chip option because the maximum number of processors per FPGA is currently two.
Tradeoffs between upset robustness and system complexity, possibly spanning multiple FPGAs, must be considered.
Swift 10 P173/MAPLD 2005
One-Chip SolutionOne-Chip SolutionRunning two processors in lockstep is conceptually simple, esp. as they can reside in a single FPGA. A fast TMR-ed comparison block is required to contain errors and not allow them to propagate into the rest of the system. A processor upset will appear to the comparison block as a disagreement, necessitating both processors be stopped within the current clock cycle. Then they both must be forced to roll back to a known good software “bookmark” or, alternatively, to reboot.
Swift 11 P173/MAPLD 2005
Flow ChartFlow Chart
One-Chip SolutionSingle Instruction Executed
(in Lockstep)
Compare Processor Outputs
Error Detected
Stop Execution
Y
Initialize Processor Reboot
Execute reboot and/or resynchronization processes
N
Swift 12 P173/MAPLD 2005
AdvantagesAdvantages
• Contained in one chip
– No chip-to-chip interconnects (minimal latency and propagation delay)
– Lower power consumption– Less board area– No chip-to-chip synchronization
• Technology is more developed and tested [See Reference 2]
Swift 13 P173/MAPLD 2005
DisadvantagesDisadvantages
• More system outages
– Reboot or rollback on every error– Not suitable for some critical real-time
applications
• Twice as many errors as on a single processor, but at least they are detected
Note: Requires extra device – either watchdog timer or external configuration scrubber
Swift 14 P173/MAPLD 2005
Two-Chip SolutionTwo-Chip Solution
With four processors in lockstep (necessitating two chips), a solution as robust as full TMR is possible. In this scheme, a pair of processors that get into a disagreement due to an upset will be stopped while the system runs without interruption on the processor pair that are in agreement. Correct internal state information is available in the working pair., preferably soon. Thus, it is possible to re-synchronize almost transparently and rapidly get back to full four-processor lockstep operation with minimal intrusion. As a side effect of using two separate FPGAs, additional robustness is possible by adding on cross-strapped configuration control.
Swift 15 P173/MAPLD 2005
Flow ChartFlow Chart
Two-Chip Solution Power up configuration (both FPGAs from the same ROM)
Parallel internal error checking
Error Detected
N
Y
Healthy FPGA takes over and initiates a full or partial reconfiguration of the upset FPGA
Wait for an opportunity to reconfigure
Resynchronization arbitrator synchronizes processors to appropriate location
Processors with disagreement halt.
Swift 16 P173/MAPLD 2005
AdvantagesAdvantages
• Reboots rare; requires simultaneous errors in two separate processors
• Processor upsets are transparently handled without system outage until convenient re-synchronization opportunites
• Enhanced robustness – outages lowered to less than the SEFI rate of ~1 in 80 years per device
• Allows added configuration robustness– Chips check each other (not self-checking)– Eliminates need for external watchdog timer
Swift 17 P173/MAPLD 2005
DisadvantagesDisadvantages
• Complicated– Inter-chip communication/synchronization– Transparent reboot/resynchronization of both
processors in chip with error
• Twice the power consumption
• In-beam testing is not yet done (although planned for the near future)
Swift 18 P173/MAPLD 2005
Three-Chip SolutionThree-Chip Solution
The three-chip implementation (also known as the “virtual FPGA” solution [Ref. 3]) takes the responsibility of error detection out of the hands of the upsetable FPGAs by adding a Radiation-Hardened ASIC. Note that only one processor per FPGA is needed. The ASIC handles stopping error propagation and re-synchronizing an upset processor. Additionally, the ASIC can be used for configuration control of all three FPGAs.
Swift 19 P173/MAPLD 2005
Flow ChartFlow Chart
Three-Chip Solution
Configure all three FPGAs
Processors execute a cycle in lockstep
Error is detected
Re-synchronize state of device with upset
Y
N
Swift 20 P173/MAPLD 2005
AdvantagesAdvantages
• Maximum robustness to upsets• Only three processors in lockstep (but in 3
chips)• More fabric available for other functions• No system outages; errors and SEFIs are
handled transparently• Most implementation details are confined to the
ASIC and don’t affect the IP in the FPGAs significantly
Swift 21 P173/MAPLD 2005
DisadvantagesDisadvantages
• Complex ASIC development for controller to vote outputs and re-load/re-sync upset processor
• ASIC development cost (currently funded though)
• Board area
Swift 22 P173/MAPLD 2005
ConclusionsConclusions
• Both two-chip and three-chip solutions have about the same robustness, power consumption, and system complication, but handle upsets better than the one-chip solution.
• The two- vs. three-chip decision mostly boils down to the familiar FPGA vs. ASIC debate
• Three-chip solution may use less power than the two-chip. (Is the ASIC’s power consumption less than that of one processor core?)
• At present, the JPL-preferred approach is the two-chip implementation achieving maximum flexibility and near maximum robustness to upsets.
Swift 23 P173/MAPLD 2005
ReferencesReferences
• [1] J. George et al., “Initial Single-Event Effects Testing and Mitigation in the Xilinx Virtex II-Pro FPGA,” Paper 211, MAPLD 2005.
• [2] M. Wang and G. Bolotin, “SEU Mitigation Techniques for Xilinx Virtex-II Pro FPGA,” Paper D110, MAPLD 2004, http://klabs.org/mapld04/presentations/session_d/ 1_d110_wang_s.ppt
• [3] J. Lyke and B. Marty, Virtual Field Programmable Gate Array Triple Modular Redundant Cell Design, Air Force Research Laboratory: Space Vehicles Directorate, AFRL-VS-PS-TR-2004-1093, April 28, 2004.