Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida

Advanced Space Advanced Space Computing with System-Computing with System-

Level Fault ToleranceLevel Fault Tolerance

Grzegorz Cieslewski, Adam Jacobs,

Chris Conger, Alan D. George

ECE Dept., University of Florida

NSF CHREC Center

2

OutlineOutline Overview

NASA Dependable Multiprocessor

Reconfigurable Fault Tolerance (RFT)

Space Applications

Novel Computing Platforms

RapidIO

Conclusions

3

OverviewOverview What is advanced space computing?

New concepts, methods, and technologies to enable and deploy high-performance computing in space – for an increasing variety of missions and applications

Why is advanced space computing vital? On-board data processing

Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing Remote data processing from Earth is no longer viable Must process sensor data where it is captured, then downlink results

On-board autonomous processing & control Remote control from Earth is often not viable Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy Autonomy requires high-speed computing for decision-making

Why is it difficult to achieve? Cannot simply strap a Cray to a rocket!

Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc. Traditional space processing technologies (RadHard) are severely limited

Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies Must ensure high levels of reliability and availability

4

Taxonomy of Fault Taxonomy of Fault ToleranceTolerance

First, let us define various possible modes/methods of providing fault tolerance (FT) Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem Software FT vs. hardware FT concepts largely similar, differences only at implementation level Radiation-hardening not listed, falls under “prevention” as opposed to detection or correction

DetectCorrect

orMask

Fault-TolerantHLL (e.g. MPI)

FT-HLL

Concurrent ErrorDetection

CED

Self-CheckingPairs

SCP

Algorithm-BasedFault-Tolerance

ABFT

Error CorrectionCodes

ECCN-Version

Programming

NVP

ByzantineResilience

BR

Checkpointing& Roll-back

CR

Software-ImplementedFault Tolerance

SIFTN-Modular

Redundancy

NMRTemporal and spatial

variants possiblefor many techniques

Most of these FT modes are currently being used at UF

5

NASA/Honeywell/UF ProjectNASA/Honeywell/UF Project 1st Space Supercomputer

Funded by NASA NMP In-situ sensor processing Autonomous control Speedups of 100 to 1000 First fault-tolerant, parallel,

reconfigurable computer for space

Infrastructure for fault-tolerant, high-speed computing in space

Robust system services Fault-tolerant MPI services Application services FPGA services Standard design framework Transparent API to resources for

earth & space scientists

NASA Dependable Multiprocessor (DM)

SystemController

B

SystemController

A(RHPPC) Data

Processor(PPC, FPGA)

#1

Spacecraft I /FMission -Specific

Devices

Instruments

. . . High-Speed Network A

Mission -SpecificSpacecraft Interface

Spacecraft I /F

Spacecraft I /F

High-Speed Network B

DataProcessor

(PPC, FPGA)#N

Reconfigurable Reconfigurable Cluster Cluster

ComputerComputer

6

Dependable Dependable MultiprocessorMultiprocessor

DM System Architecture Dual system controllers

Redundant radiation-hardened PPC boards

Monitor data processors’ health and communicate with spacecraft

Data processing engines High-performance, low-power

COTS SBCs running Linux PowerPC with AltiVec capabilities Optional FPGA co-processor for

additional performance Scalable to 20 data processing

nodes Redundant Interconnect

Dual GigE connections Automatically switch networks

when error is detected

DM Middleware (DMM) FT System Services

Manages status and health of multiple concurrent jobs

FT Embedded MPI (FEMPI) Lightweight subset of MPI Allows fault recovery without

restarting an entire parallel application

Application & FPGA Services Commonly used libraries such as

ATLAS, FFTW, GSL Simplified, generic API for FPGA

usage through USURP*

High-Availability Middleware Framework used to enable health

monitoring of cluster

* USURP is a standardized interface specification for RC platforms, developed by researchers at UF

7

DMM ComponentsDMM Components Mission Manager (MM)

Controls high-level job deployment Facilitates replication of lower-level

jobs Spatial or temporal replication Automatically compares and

validates outputs Monitors real-time deadlines Enables roll-forward / roll-back

when faults occur Job Manager (JM)

Controls low-level job deployment and scheduling across system

FT Manager (FTM) Manages low-level system faults

(node crash, job crash)

JM Agent (JMA) Deploys and monitors

programs on given node Provides application “heartbeat”

to system controller Mass Data Store (MDS)

Provides reliable centralized data services

Enables reliable checkpointing

Hardened Processor COTS Packet-Switched Network COTS Processor

COTS OS and Drivers COTS OS and Drivers

Reliable Messaging Middleware

JM FTM

Reliable Messaging Middleware

JMA ASL

JM – Job Manager FEMPI – Fault-Tolerant Embedded MPI JMA – Job Manager Agent ASL – Application Services Library FTM – Fault Tolerance Manager FCL – FPGA Coprocessor Library

Hardened System

COTS Data Processors

FCL FEMPI

MPI Application Process

Mission-Specific Parameters

Mission Manager

8

Algorithm-Based Fault Algorithm-Based Fault ToleranceTolerance Commonly refers to matrix coding method that is

preserved through certain linear algebra operations Matrix and vector multiply

Discrete Fourier Transform Discrete Wavelet Transform

Matrix decomposition: C = AB (LU, QR, Cholesky) Matrix inversion

Used to detect errors in these operations, and in certain cases allows for error correction

ABFT algorithms integrate with DM through Application Services API

An improved method of using ABFT on the 2D-FFT and SAR has been researched at UF Uses Hamming encoding Low overhead due to ABFT

Important aspects of ABFT currently under investigation at UF Round-off analysis Coverage analysis Code types Encoding and Decoding strategies Overhead

1. Augment the Input

Matrix

2. Partial Transform

3. Verify the Checksums

Fault-tolerant Partial Transform

Time Domain Data Matrix

Frequency Domain Data

Matrix

Fault-Tolerant 2D DFT

4. Transpose3. Fault-

tolerant Partial Transform

2. Transpose1. Fault-

tolerant Partial Transform

Computation Flow of Fault-tolerant 2D-FFT

15%

25%

35%

45%

55%

65%

75%

85%

95%

128 256 512 1024 2048 4096

Image Size [N x N]

Ove

rhea

d I

ncu

rred

Error Free

With Error

Experimental Overhead of Fault-tolerant RDP vs. a Fault-intolerant Version

9

Source Code Source Code TransformationsTransformations Most science applications are inherently non-fault-tolerant

Requires SIFT framework to improve reliability Possible to immunize programs against most errors by

transforming application source code Less overhead More control over FT techniques Compiler-independent Integrates with DM system through Application Services API

Custom source-to-source (S2S) transformation tool is currently under development at UF Accepts C source files as inputs Generates fault tolerant versions Uses fine-grain NMR-type of approach to provide improved

reliability and dependability Provides means of control flow checking (CFC) through software Minimizes number of undetected errors

Transformation options to be supported by the tool Variable replication Function replication Memory duplication / memory checking Synchronization intervals Condition evaluation

Post-evaluation verification Evaluation using replicated variables

Block protection

10

Reconfigurable Fault Reconfigurable Fault ToleranceTolerance GOAL – Research how to take advantage of reconfigurable nature of FPGAs, to provide

dynamically-adaptive fault tolerance in RC systems Leverage partial reconfiguration (PR) where advantageous

Explore virtual architectures to enable PR and reconfigurable

fault tolerance (RFT)

MOTIVATION – Why go with fixed/static FT, when

performance & reliability can be tuned as needed? Environmentally-aware & adaptive computing is wave of future

Achieving power savings and/or performance improvement,

without sacrificing reliability

CHALLENGES – limitations in concepts and tools, open-

ended problem requires innovative solutions Conventional methods typically based upon radiation-

hardened components and/or fault masking via chip-level TMR

Highly-custom nature of FPGA architectures in different systems and

apps makes defining a common approach to FT difficult

Satellite orbits, passing throughthe Van Allen radiation belt

Per

form

ance

Fault Tolerance

11

Reconfigurable FTReconfigurable FT Virtual Architecture for RFT

Novel concept of adaptable component-level protection (ACP)

Common components within VA: Adaptable protection frame – largely module/design-independent (see figure above) Error Status Register (ESR) for system-level error tracking/handling Re-synchronization controller or interfaces, for state saving and restoration Configuration controller, two options:

Internal configuration through ICAP External configuration controller

Benefits of internal protection: Early error detection and handling = faster recovery Redundancy can be changed into parallelism PR can be leveraged to provide uninterrupted operation of

non-failed components Challenges of internal protection:

Impossible to eliminate single points of failure, may still need higher-level (external) detection and handling

Stronger possibility of fault/error going unnoticed Single-event functional interrupts (SEFI) are major

concern

Protection Module “Frame”

A BB

2× parallel, SCP

A

no parallel, TMR

BA DC

4× parallel, single

BLANK

BLANK

no parallel, SCP“sockets” for modulesFPGA arch. diagram

AdaptableComponent-

levelProtection

Reliable ICAP

Controller

Reliable ESR&

Resynch. Controller

Reliable StaticLogic

ACP Module #1

ACP Module #N

To RH FPGA_CON

Error detection indicators

State saving &

restoration interfaces

VA concept diagram

FPGA

12

Space ApplicationsSpace Applications Synthetic Aperture Radar (SAR)

Used to form high-resolution images of Earth’s surface from moving platform in space

Patch-based processing with significant amount of overlap between patch boundaries

Parallelizable on multiple levels of granularity, possible without need for any inter-processor communication (one patch per node)

2-dimensional data set, can range in size from several hundred Megabytes to Gigabytes

Data set not significantly reduced through course of application

Highly amenable to ABFT; possible application for the Dependable Multiprocessor project

13

Space ApplicationsSpace Applications Hyperspectral Imaging (HSI)

Uses traditional beamforming techniques to perform coarse-grained classification on hyperspectral images

Adjustable to enable real-time processing Mostly embarrassingly parallel, exception being

weight computation (shown in red below) 3-dimensional data set, reduced through course of

application Auto-correlation sample matrix (ACSM) calculation

and beamforming (detection) amenable to ABFT Suggest NMR for weight computation (weight)

Parallel and multi-FPGA decompositions explored

14

Space ApplicationsSpace Applications Cosmic Ray Elimination

Uses image processing techniques to remove artifacts caused by cosmic rays

Image shows pre- and post-processed versions of a Hubble Telescope observation

Images are highly parallelizable, with minimal communication necessary

Main computation: median filtering Fault-tolerant median filter developed

Other portions of algorithm replicated by hand or S2S translator

Other aerospace-related application kernels Space-Time Adaptive Processing (STAP) Ground Moving Target Indicator (GMTI) Airborne LIDAR Digital Down Conversion (DDC) PDF Estimation

15

Novel Computing Novel Computing PlatformsPlatforms Fixed multi-core (FMC) devices

Cell Heterogeneous, vector compute engine, 3.2 GHz

clock rate, ~70 W max. power consumption GPU

Homogeneous, many (e.g. 100+) stream processors, ~1.5 GHz clock rate, ~120 W max. power consumption

Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA)

Heterogeneous, coarse-grained processing elements, 1 GHz clock rate, ~35 W max power consumption

Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements,

max. clock rate ~500 MHz, achievable clock rate varies, ~30 W max. power consumption

Tilera Homogeneous, coarse-grained processing elements

(64 32-bit MIPS-like processors on-chip), ~750 MHz clock rate, ~30 W max. power consumption

Element CXi Heterogeneous, coarse-grained processing

elements, 200 MHz clock rate, ~1 W max. power consumption

Cell processor block diagram - http://www.research.ibm.com/journal/rd/494/kahle.html

FPOA architecture - http://www.mathstar.com/Architecture.php

1616

RC: Vital Technology for RC: Vital Technology for SpaceSpace

HPEC devices featured here; similar results vs. 65nm Xeon, 90nm GPU, etc. (see RSSI’08).

Results excerpted from pending presentation from CHREC-UF site for HPEC’08 Workshop.

Versatility in space missions (adapts as needs demand) Fixed archs. burdened with fixed choices, limited tradeoffs

Performance in space missions (speed, power, size, etc.) e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)

Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible

Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA]

Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA]

Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible

Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA]

Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA]

17

RapidIORapidIO High-speed embedded system

interconnect, replacement for bus-based backplanes Parallel and serial variants, serial is wave of future Multiple programming models

Research with RapidIO at UF Simulative research studying capability of RapidIO-based

computing platforms to support space-based radar (SBR) processing

Custom testbed designed and built, for verification of simulation models & experimentation with RapidIO & FPGAs

256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges

0

10

20

30

40

50

60

70

80

90

100

0 256 512 768 1024 1280 1536 1792 2048

Time (ms)

SD

RA

M U

tiliz

ati

on

(%

)

Experimental logic analyzer measurements

Visualization of simulated GMTI application progress

Trace files

18

ConclusionsConclusions Fault tolerance for space should be more than

RadHard components & spatial TMR designs Fixed worst-case designs extremely limited in perf/Watt Instead, many FT methods & modes can be exploited Adaptive systems that react to environmental changes COTS featured inside critical performance path RadHard for FT management, outside critical perf. path

UF active on many space-related FT issues NASA Dependable Multiprocessor, CHREC RFT F4-08 Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc. Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc. Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.

19

2009 IEEE Aerospace 2009 IEEE Aerospace ConferenceConference Track 7.12 Dependable Software for High Performance

Embedded Computing Platforms Transient error detection and recovery techniques

Compiler-based fault-tolerant techniques Algorithm-based fault-tolerant techniques

Tools and techniques for designing reliable software SIFT management frameworks Software dependability analysis Adaptive fault-tolerant techniques FT applications

Track Chairs Richard Linderman [email protected] Grzegorz Cieslewski [email protected]

Dates Abstract Submissions Due: July 1st, 2008 Paper Submissions Due: November 2nd, 2008

Documents

Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida