View
215
Download
0
Category
Tags:
Preview:
Citation preview
Advanced Space Advanced Space Computing with System-Computing with System-
Level Fault ToleranceLevel Fault Tolerance
Grzegorz Cieslewski, Adam Jacobs,
Chris Conger, Alan D. George
ECE Dept., University of Florida
NSF CHREC Center
2
OutlineOutline Overview
NASA Dependable Multiprocessor
Reconfigurable Fault Tolerance (RFT)
Space Applications
Novel Computing Platforms
RapidIO
Conclusions
3
OverviewOverview What is advanced space computing?
New concepts, methods, and technologies to enable and deploy high-performance computing in space – for an increasing variety of missions and applications
Why is advanced space computing vital? On-board data processing
Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing Remote data processing from Earth is no longer viable Must process sensor data where it is captured, then downlink results
On-board autonomous processing & control Remote control from Earth is often not viable Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy Autonomy requires high-speed computing for decision-making
Why is it difficult to achieve? Cannot simply strap a Cray to a rocket!
Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc. Traditional space processing technologies (RadHard) are severely limited
Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies Must ensure high levels of reliability and availability
4
Taxonomy of Fault Taxonomy of Fault ToleranceTolerance
First, let us define various possible modes/methods of providing fault tolerance (FT) Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem Software FT vs. hardware FT concepts largely similar, differences only at implementation level Radiation-hardening not listed, falls under “prevention” as opposed to detection or correction
DetectCorrect
orMask
Fault-TolerantHLL (e.g. MPI)
FT-HLL
Concurrent ErrorDetection
CED
Self-CheckingPairs
SCP
Algorithm-BasedFault-Tolerance
ABFT
Error CorrectionCodes
ECCN-Version
Programming
NVP
ByzantineResilience
BR
Checkpointing& Roll-back
CR
Software-ImplementedFault Tolerance
SIFTN-Modular
Redundancy
NMRTemporal and spatial
variants possiblefor many techniques
Most of these FT modes are currently being used at UF
5
NASA/Honeywell/UF ProjectNASA/Honeywell/UF Project 1st Space Supercomputer
Funded by NASA NMP In-situ sensor processing Autonomous control Speedups of 100 to 1000 First fault-tolerant, parallel,
reconfigurable computer for space
Infrastructure for fault-tolerant, high-speed computing in space
Robust system services Fault-tolerant MPI services Application services FPGA services Standard design framework Transparent API to resources for
earth & space scientists
NASA Dependable Multiprocessor (DM)
SystemController
B
SystemController
A(RHPPC) Data
Processor(PPC, FPGA)
#1
Spacecraft I /FMission -Specific
Devices
Instruments
. . . High-Speed Network A
Mission -SpecificSpacecraft Interface
Spacecraft I /F
Spacecraft I /F
High-Speed Network B
DataProcessor
(PPC, FPGA)#N
Reconfigurable Reconfigurable Cluster Cluster
ComputerComputer
6
Dependable Dependable MultiprocessorMultiprocessor
DM System Architecture Dual system controllers
Redundant radiation-hardened PPC boards
Monitor data processors’ health and communicate with spacecraft
Data processing engines High-performance, low-power
COTS SBCs running Linux PowerPC with AltiVec capabilities Optional FPGA co-processor for
additional performance Scalable to 20 data processing
nodes Redundant Interconnect
Dual GigE connections Automatically switch networks
when error is detected
DM Middleware (DMM) FT System Services
Manages status and health of multiple concurrent jobs
FT Embedded MPI (FEMPI) Lightweight subset of MPI Allows fault recovery without
restarting an entire parallel application
Application & FPGA Services Commonly used libraries such as
ATLAS, FFTW, GSL Simplified, generic API for FPGA
usage through USURP*
High-Availability Middleware Framework used to enable health
monitoring of cluster
* USURP is a standardized interface specification for RC platforms, developed by researchers at UF
7
DMM ComponentsDMM Components Mission Manager (MM)
Controls high-level job deployment Facilitates replication of lower-level
jobs Spatial or temporal replication Automatically compares and
validates outputs Monitors real-time deadlines Enables roll-forward / roll-back
when faults occur Job Manager (JM)
Controls low-level job deployment and scheduling across system
FT Manager (FTM) Manages low-level system faults
(node crash, job crash)
JM Agent (JMA) Deploys and monitors
programs on given node Provides application “heartbeat”
to system controller Mass Data Store (MDS)
Provides reliable centralized data services
Enables reliable checkpointing
Hardened Processor COTS Packet-Switched Network COTS Processor
COTS OS and Drivers COTS OS and Drivers
Reliable Messaging Middleware
JM FTM
Reliable Messaging Middleware
JMA ASL
JM – Job Manager FEMPI – Fault-Tolerant Embedded MPI JMA – Job Manager Agent ASL – Application Services Library FTM – Fault Tolerance Manager FCL – FPGA Coprocessor Library
Hardened System
COTS Data Processors
FCL FEMPI
MPI Application Process
Mission-Specific Parameters
Mission Manager
8
Algorithm-Based Fault Algorithm-Based Fault ToleranceTolerance Commonly refers to matrix coding method that is
preserved through certain linear algebra operations Matrix and vector multiply
Discrete Fourier Transform Discrete Wavelet Transform
Matrix decomposition: C = AB (LU, QR, Cholesky) Matrix inversion
Used to detect errors in these operations, and in certain cases allows for error correction
ABFT algorithms integrate with DM through Application Services API
An improved method of using ABFT on the 2D-FFT and SAR has been researched at UF Uses Hamming encoding Low overhead due to ABFT
Important aspects of ABFT currently under investigation at UF Round-off analysis Coverage analysis Code types Encoding and Decoding strategies Overhead
1. Augment the Input
Matrix
2. Partial Transform
3. Verify the Checksums
Fault-tolerant Partial Transform
Time Domain Data Matrix
Frequency Domain Data
Matrix
Fault-Tolerant 2D DFT
4. Transpose3. Fault-
tolerant Partial Transform
2. Transpose1. Fault-
tolerant Partial Transform
Computation Flow of Fault-tolerant 2D-FFT
15%
25%
35%
45%
55%
65%
75%
85%
95%
128 256 512 1024 2048 4096
Image Size [N x N]
Ove
rhea
d I
ncu
rred
Error Free
With Error
Experimental Overhead of Fault-tolerant RDP vs. a Fault-intolerant Version
9
Source Code Source Code TransformationsTransformations Most science applications are inherently non-fault-tolerant
Requires SIFT framework to improve reliability Possible to immunize programs against most errors by
transforming application source code Less overhead More control over FT techniques Compiler-independent Integrates with DM system through Application Services API
Custom source-to-source (S2S) transformation tool is currently under development at UF Accepts C source files as inputs Generates fault tolerant versions Uses fine-grain NMR-type of approach to provide improved
reliability and dependability Provides means of control flow checking (CFC) through software Minimizes number of undetected errors
Transformation options to be supported by the tool Variable replication Function replication Memory duplication / memory checking Synchronization intervals Condition evaluation
Post-evaluation verification Evaluation using replicated variables
Block protection
10
Reconfigurable Fault Reconfigurable Fault ToleranceTolerance GOAL – Research how to take advantage of reconfigurable nature of FPGAs, to provide
dynamically-adaptive fault tolerance in RC systems Leverage partial reconfiguration (PR) where advantageous
Explore virtual architectures to enable PR and reconfigurable
fault tolerance (RFT)
MOTIVATION – Why go with fixed/static FT, when
performance & reliability can be tuned as needed? Environmentally-aware & adaptive computing is wave of future
Achieving power savings and/or performance improvement,
without sacrificing reliability
CHALLENGES – limitations in concepts and tools, open-
ended problem requires innovative solutions Conventional methods typically based upon radiation-
hardened components and/or fault masking via chip-level TMR
Highly-custom nature of FPGA architectures in different systems and
apps makes defining a common approach to FT difficult
Satellite orbits, passing throughthe Van Allen radiation belt
Per
form
ance
Fault Tolerance
11
Reconfigurable FTReconfigurable FT Virtual Architecture for RFT
Novel concept of adaptable component-level protection (ACP)
Common components within VA: Adaptable protection frame – largely module/design-independent (see figure above) Error Status Register (ESR) for system-level error tracking/handling Re-synchronization controller or interfaces, for state saving and restoration Configuration controller, two options:
Internal configuration through ICAP External configuration controller
Benefits of internal protection: Early error detection and handling = faster recovery Redundancy can be changed into parallelism PR can be leveraged to provide uninterrupted operation of
non-failed components Challenges of internal protection:
Impossible to eliminate single points of failure, may still need higher-level (external) detection and handling
Stronger possibility of fault/error going unnoticed Single-event functional interrupts (SEFI) are major
concern
Protection Module “Frame”
A BB
2× parallel, SCP
A
no parallel, TMR
BA DC
4× parallel, single
BLANK
BLANK
no parallel, SCP“sockets” for modulesFPGA arch. diagram
AdaptableComponent-
levelProtection
Reliable ICAP
Controller
Reliable ESR&
Resynch. Controller
Reliable StaticLogic
ACP Module #1
ACP Module #N
To RH FPGA_CON
Error detection indicators
State saving &
restoration interfaces
VA concept diagram
FPGA
12
Space ApplicationsSpace Applications Synthetic Aperture Radar (SAR)
Used to form high-resolution images of Earth’s surface from moving platform in space
Patch-based processing with significant amount of overlap between patch boundaries
Parallelizable on multiple levels of granularity, possible without need for any inter-processor communication (one patch per node)
2-dimensional data set, can range in size from several hundred Megabytes to Gigabytes
Data set not significantly reduced through course of application
Highly amenable to ABFT; possible application for the Dependable Multiprocessor project
13
Space ApplicationsSpace Applications Hyperspectral Imaging (HSI)
Uses traditional beamforming techniques to perform coarse-grained classification on hyperspectral images
Adjustable to enable real-time processing Mostly embarrassingly parallel, exception being
weight computation (shown in red below) 3-dimensional data set, reduced through course of
application Auto-correlation sample matrix (ACSM) calculation
and beamforming (detection) amenable to ABFT Suggest NMR for weight computation (weight)
Parallel and multi-FPGA decompositions explored
14
Space ApplicationsSpace Applications Cosmic Ray Elimination
Uses image processing techniques to remove artifacts caused by cosmic rays
Image shows pre- and post-processed versions of a Hubble Telescope observation
Images are highly parallelizable, with minimal communication necessary
Main computation: median filtering Fault-tolerant median filter developed
Other portions of algorithm replicated by hand or S2S translator
Other aerospace-related application kernels Space-Time Adaptive Processing (STAP) Ground Moving Target Indicator (GMTI) Airborne LIDAR Digital Down Conversion (DDC) PDF Estimation
15
Novel Computing Novel Computing PlatformsPlatforms Fixed multi-core (FMC) devices
Cell Heterogeneous, vector compute engine, 3.2 GHz
clock rate, ~70 W max. power consumption GPU
Homogeneous, many (e.g. 100+) stream processors, ~1.5 GHz clock rate, ~120 W max. power consumption
Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA)
Heterogeneous, coarse-grained processing elements, 1 GHz clock rate, ~35 W max power consumption
Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements,
max. clock rate ~500 MHz, achievable clock rate varies, ~30 W max. power consumption
Tilera Homogeneous, coarse-grained processing elements
(64 32-bit MIPS-like processors on-chip), ~750 MHz clock rate, ~30 W max. power consumption
Element CXi Heterogeneous, coarse-grained processing
elements, 200 MHz clock rate, ~1 W max. power consumption
Cell processor block diagram - http://www.research.ibm.com/journal/rd/494/kahle.html
FPOA architecture - http://www.mathstar.com/Architecture.php
1616
RC: Vital Technology for RC: Vital Technology for SpaceSpace
HPEC devices featured here; similar results vs. 65nm Xeon, 90nm GPU, etc. (see RSSI’08).
Results excerpted from pending presentation from CHREC-UF site for HPEC’08 Workshop.
Versatility in space missions (adapts as needs demand) Fixed archs. burdened with fixed choices, limited tradeoffs
Performance in space missions (speed, power, size, etc.) e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)
Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible
Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA]
Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA]
Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible
Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA]
Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA]
17
RapidIORapidIO High-speed embedded system
interconnect, replacement for bus-based backplanes Parallel and serial variants, serial is wave of future Multiple programming models
Research with RapidIO at UF Simulative research studying capability of RapidIO-based
computing platforms to support space-based radar (SBR) processing
Custom testbed designed and built, for verification of simulation models & experimentation with RapidIO & FPGAs
256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges
0
10
20
30
40
50
60
70
80
90
100
0 256 512 768 1024 1280 1536 1792 2048
Time (ms)
SD
RA
M U
tiliz
ati
on
(%
)
Experimental logic analyzer measurements
Visualization of simulated GMTI application progress
Trace files
18
ConclusionsConclusions Fault tolerance for space should be more than
RadHard components & spatial TMR designs Fixed worst-case designs extremely limited in perf/Watt Instead, many FT methods & modes can be exploited Adaptive systems that react to environmental changes COTS featured inside critical performance path RadHard for FT management, outside critical perf. path
UF active on many space-related FT issues NASA Dependable Multiprocessor, CHREC RFT F4-08 Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc. Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc. Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.
19
2009 IEEE Aerospace 2009 IEEE Aerospace ConferenceConference Track 7.12 Dependable Software for High Performance
Embedded Computing Platforms Transient error detection and recovery techniques
Compiler-based fault-tolerant techniques Algorithm-based fault-tolerant techniques
Tools and techniques for designing reliable software SIFT management frameworks Software dependability analysis Adaptive fault-tolerant techniques FT applications
Track Chairs Richard Linderman Richard.Linderman@rl.af.mil Grzegorz Cieslewski cieslewski@hcs.ufl.edu
Dates Abstract Submissions Due: July 1st, 2008 Paper Submissions Due: November 2nd, 2008
Recommended