Cieslewski UF FTWorkshop Final2

8/3/2019 Cieslewski UF FTWorkshop Final2

1/19

Advanced Space Computing withAdvanced Space Computing withSystemSystem--Level Fault ToleranceLevel Fault Tolerance

Grzegorz Cieslewski, Adam Jacobs,

Chris Conger, Alan D. George

ECE Dept., University of Florida

NSF CHREC Center


2/19

2

OutlineOutline

Overview

NASA Dependable Multiprocessor

Reconfigurable Fault Tolerance (RFT)

Space Applications

Novel Computing Platforms

RapidIO

Conclusions


3/19

3

OverviewOverview What is advanced space computing?

New concepts, methods, and technologies to enable and deploy high-performancecomputing in space for an increasing variety of missions and applications

Why is advanced space computing vital? On-board data processing

Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing

Remote data processing from Earth is no longer viable

Must process sensor data where it is captured, then downlink results

On-board autonomous processing & control Remote control from Earth is often not viable

Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy

Autonomy requires high-speed computing for decision-making

Why is it difficult to achieve? Cannot simply strap a Cray to a rocket!

Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc.

Traditional space processing technologies (RadHard) are severely limited

Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies

Must ensure high levels of reliability and availability


4/19

4

Taxonomy of Fault ToleranceTaxonomy of Fault Tolerance

First, let us define various possible modes/methods of providing fault tolerance (FT) Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem

Software FT vs. hardware FT concepts largely similar, differences only at implementation level

Radiation-hardening not listed, falls under prevention as opposed to detection or correction

DetectCorrect

orMask

Fault-TolerantHLL (e.g. MPI)

FT-HLL

Concurrent ErrorDetection

CED

Self-CheckingPairs

SCP

Algorithm-BasedFault-Tolerance

ABFT

Error CorrectionCodes

ECCN-Version

Programming

NVP

ByzantineResilience

BR

Checkpointing

& Roll-back

CR

Software-ImplementedFault Tolerance

SIFTN-Modular

Redundancy

NMR

Temporaland spatialvariants possiblefor many techniques

Most of these FTmodes are currentlybeing used at UF


5/19

5

NASA/Honeywell/UF ProjectNASA/Honeywell/UF Project1st Space Supercomputer

Funded by NASA NMP

In-situ sensor processing

Autonomous control Speedups of 100 to 1000

First fault-tolerant, parallel,reconfigurable computer for space

Infrastructure for fault-tolerant,high-speed computing in space

Robust system services

Fault-tolerant MPI services

Application services FPGA services

Standard design framework

Transparent API to resources for

earth & space scientists

NASA Dependable Multiprocessor (DM)

SystemController

B

SystemController

A(RHPPC) Data

Processor(PPC, FPGA)

#1

Spacecraft I/FMission -Specific

Devices

Instruments

. . .

High-Speed Network A

Mission -Specific

Spacecraft Interface

Spacecraft I /F

Spacecraft I /F

High-Speed Network B

DataProcessor

(PPC, FPGA)

#N

ReconfigurableReconfigurable

ClusterCluster

ComputerComputer


6/19

6

Dependable MultiprocessorDependable Multiprocessor

DM System Architecture

Dual system controllers

Redundant radiation-hardened PPC

boards Monitor data processors health and

communicate with spacecraft

Data processing engines

High-performance, low-power

COTS SBCs running Linux PowerPC with AltiVec capabilities

Optional FPGA co-processor for

additional performance

Scalable to 20 data processing

nodes

Redundant Interconnect

Dual GigE connections

Automatically switch networks when

error is detected

DM Middleware (DMM)

FT System Services Manages status and health of

multiple concurrent jobs

FT Embedded MPI (FEMPI) Lightweight subset of MPI

Allows fault recovery withoutrestarting an entire parallelapplication

Application & FPGA Services Commonly used libraries such as

ATLAS, FFTW, GSL

Simplified, generic API for FPGAusage through USURP*

High-Availability Middleware Framework used to enable health

monitoring of cluster

* USURP is a standardized interface

specification for RC platforms,

developed by researchers at UF


7/19

7

DMM ComponentsDMM Components

Mission Manager (MM)

Controls high-level job deployment

Facilitates replication of lower-leveljobs

Spatial or temporal replication

Automatically compares andvalidates outputs

Monitors real-time deadlines

Enables roll-forward / roll-backwhen faults occur

Job Manager (JM)

Controls low-level job deploymentand scheduling across system

FT Manager (FTM)

Manages low-level system faults(node crash, job crash)

JM Agent (JMA)

Deploys and monitorsprograms on given node

Provides application heartbeatto system controller

Mass Data Store (MDS) Provides reliable centralized data

services

Enables reliable checkpointing

Hardened Processor COTS Packet-Switched Network COTS Processor

COTS OS and Drivers COTS OS and Drivers

Reliable Messaging Middleware

JM FTM

Reliable Messaging Middleware

JMA ASL

JM Job Manager FEMPI Fault-Tolerant Embedded MPIJMA Job Manager Agent ASL Application Services Library

FTM Fault Tolerance Manager FCL FPGA Coprocessor Library

Hardened System

COTS Data Processors

FCL FEMPI

MPI Application Process

Mission-Specific Parameters

Mission Manager


8/19

8

AlgorithmAlgorithm--Based Fault ToleranceBased Fault Tolerance

Commonly refers to matrix coding method that ispreserved through certain linear algebra operations Matrix and vector multiply

Discrete Fourier Transform Discrete Wavelet Transform

Matrix decomposition: C = AB (LU, QR, Cholesky) Matrix inversion

Used to detect errors in these operations, and in certaincases allows for error correction

ABFT algorithms integrate with DM through ApplicationServices API

An improved method of using ABFT on the 2D-FFT andSAR has been researched at UF Uses Hamming encoding Low overhead due to ABFT

Important aspects of ABFT currently under investigation

at UF Round-off analysis Coverage analysis Code types Encoding and Decoding strategies Overhead

Fault-tolerant Partial Transform

Computation Flow of Fault-tolerant 2D-FFT

15%

25%

35%

45%

55%

65%

75%

85%

95%

128 256 512 1024 2048 4096

Image Size [N x N]

OverheadIncurred

Error Free

With Error

Experimental Overhead of Fault-tolerantRDP vs. a Fault-intolerant Version


9/19

9

Source Code TransformationsSource Code Transformations

Most science applications are inherently non-fault-tolerant Requires SIFT framework to improve reliability Possible to immunize programs against most errors by

transforming application source code Less overhead

More control over FT techniques

Compiler-independent Integrates with DM system through Application Services API

Custom source-to-source (S2S) transformation tool iscurrently under development at UF Accepts C source files as inputs Generates fault tolerant versions

Uses fine-grain NMR-type of approach to provide improvedreliability and dependability Provides means of control flow checking (CFC) through software Minimizes number of undetected errors

Transformation options to be supported by the tool Variable replication Function replication Memory duplication / memory checking Synchronization intervals

Condition evaluation Post-evaluation verification Evaluation using replicated variables

Block protection


10/19

10

Reconfigurable Fault ToleranceReconfigurable Fault Tolerance GOAL Research how to take advantage of reconfigurable nature of FPGAs, to provide

dynamically-adaptive fault tolerance in RC systems

Leverage partial reconfiguration(PR) where advantageous

Explore virtual architectures to enable PR and reconfigurable

fault tolerance(RFT)

MOTIVATION Why go with fixed/static FT, when

performance & reliability can be tuned as needed?

Environmentally-aware & adaptive computing is wave of future

Achieving power savings and/or performance improvement,

without sacrificing reliability

CHALLENGES limitations in concepts and tools,

open-ended problem requires innovative solutions

Conventional methods typically based upon radiation-

hardened components and/or fault masking via chip-level TMR

Highly-custom nature of FPGA architectures in different systems

and apps makes defining a common approach to FT difficult

Satellite orbits, passing throughthe Van Allen radiation belt

Fault Tolerance


11/19

11

Reconfigurable FTReconfigurable FT Virtual Architecture for RFT

Novel concept of adaptablecomponent-level protection (ACP)

Common components within VA: Adaptable protection frame largely module/design-independent (see figure above)

Error Status Register (ESR) for system-level error tracking/handling

Re-synchronization controller or interfaces, for state saving and restoration

Configuration controller, two options: Internal configuration through ICAP

External configuration controller

Benefits of internal protection: Early error detection and handling = faster recovery

Redundancy can be changed into parallelism

PR can be leveraged to provide uninterruptedoperation of non-failed components

Challenges of internal protection: Impossible to eliminate single points of failure, may still

need higher-level (external) detection and handling

Stronger possibility of fault/error going unnoticed

Single-event functional interrupts (SEFI) are majorconcern

A BB

2 parallel, SCP

A

no parallel, TMR

BA DC

4 parallel, single

BLA

NK

BLA

NK

no parallel, SCPsockets for modules

AdaptableComponent-

levelProtection

VA concept diagram

FPGA


12/19

12

Space ApplicationsSpace Applications Synthetic Aperture Radar (SAR)

Used to form high-resolution images of Earthssurface from moving platform in space

Patch-based processing with significant amountof overlap between patch boundaries

Parallelizable on multiple levels of granularity,possible without need for anyinter-processorcommunication (one patch per node)

2-dimensional data set, can range in size fromseveral hundred Megabytes to Gigabytes

Data set notsignificantly reduced through course

of application Highly amenable to ABFT; possible application for

the Dependable Multiprocessor project


13/19

13

Space ApplicationsSpace Applications Hyperspectral Imaging (HSI)

Uses traditional beamforming techniques toperform coarse-grained classification onhyperspectral images

Adjustable to enable real-time processing

Mostly embarrassingly parallel, exception beingweight computation (shown in red below)

3-dimensional data set, reduced through course ofapplication

Auto-correlation sample matrix (ACSM) calculationand beamforming (detection) amenable to ABFT

Suggest NMR for weight computation (weight) Parallel and multi-FPGA decompositions explored


14/19

14

Space ApplicationsSpace Applications Cosmic Ray Elimination

Uses image processing techniques to remove artifactscaused by cosmic rays

Image shows pre- and post-processed versions of a HubbleTelescope observation

Images are highly parallelizable, with minimalcommunication necessary

Main computation: median filtering

Fault-tolerant median filter developed

Other portions of algorithm replicated by hand or S2Stranslator

Other aerospace-related application kernels

Space-Time Adaptive Processing (STAP)

Ground Moving Target Indicator (GMTI)

Airborne LIDAR

Digital Down Conversion (DDC)

PDF Estimation


15/19

15

Novel Computing PlatformsNovel Computing Platforms

Fixed multi-core (FMC) devices Cell

Heterogeneous, vector compute engine, 3.2 GHzclock rate, ~70 W max. power consumption

GPU Homogeneous, many (e.g. 100+) stream processors,

~1.5 GHz clock rate, ~120 W max. powerconsumption

Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA)

Heterogeneous, coarse-grained processingelements, 1 GHz clock rate, ~35 W max powerconsumption

Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements,

max. clock rate ~500 MHz, achievable clock ratevaries, ~30 W max. power consumption

Tilera Homogeneous, coarse-grained processing elements

(64 32-bit MIPS-like processors on-chip), ~750 MHzclock rate, ~30 W max. power consumption

Element CXi Heterogeneous, coarse-grained processing

elements, 200 MHz clock rate, ~1 W max. powerconsumption

Cell processor block diagram -http://www.research.ibm.com/journal/rd/494/kahle.html

FPOA architecture -http://www.mathstar.com/Architecture.php


16/19

1616

RC: Vital Technology for SpaceRC: Vital Technology for Space

HPEC devices featuredhere; similar results vs.65nm Xeon, 90nm GPU,etc. (see RSSI08).

Results excerpted frompending presentationfrom CHREC-UF site forHPEC08 Workshop.

Versatility in space missions (adapts as needs demand)

Fixed archs. burdened with fixed choices, limited tradeoffs

Performance in space missions (speed, power, size, etc.)

e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)

Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible

Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]

Power scales linearly with

resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]

Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible

Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]

Power scales linearly with

resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]


17/19

17

RapidIORapidIO High-speed embedded system

interconnect, replacement for bus-based backplanes

Parallel and serial variants, serial is wave of future

Multiple programming models

Research with RapidIO at UF Simulative research studying capability of RapidIO-based

computing platforms to support space-based radar (SBR)processing

Custom testbed designed and built, for verification ofsimulation models & experimentation with RapidIO & FPGAs

256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges

0

10

20

30

40

50

6070

80

90

100

0 256 512 768 1024 1280 1536 1792 2048

Time (ms)

SDRAMUtilization(%)

Experimental logic analyzer measurements

Visualization of simulated GMTI application progress

Trace files


18/19

18

ConclusionsConclusions

Fault tolerance for space should be more thanRadHard components & spatial TMR designs

Fixed worst-case designs extremely limited in perf/Watt

Instead, many FT methods & modes can be exploited

Adaptive systems that react to environmental changes

COTS featured inside critical performance path

RadHard for FT management, outside critical perf. path

UF active on many space-related FT issues

NASA Dependable Multiprocessor, CHREC RFT F4-08 Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc.

Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc.

Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.


19/19

19

2009 IEEE Aerospace Conference2009 IEEE Aerospace Conference

Track 7.12 Dependable Software for High PerformanceEmbedded Computing Platforms Transient error detection and recovery techniques

Compiler-based fault-tolerant techniques Algorithm-based fault-tolerant techniques

Tools and techniques for designing reliable software SIFT management frameworks Software dependability analysis Adaptive fault-tolerant techniques FT applications

Track Chairs Richard Linderman [email protected] Grzegorz Cieslewski [email protected]

Dates Abstract Submissions Due: July 1st, 2008 Paper Submissions Due: November 2nd, 2008

Documents

Cieslewski UF FTWorkshop Final2