Cieslewski UF FTWorkshop Final2

Embed Size (px)

Citation preview

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    1/19

    Advanced Space Computing withAdvanced Space Computing withSystemSystem--Level Fault ToleranceLevel Fault Tolerance

    Grzegorz Cieslewski, Adam Jacobs,

    Chris Conger, Alan D. George

    ECE Dept., University of Florida

    NSF CHREC Center

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    2/19

    2

    OutlineOutline

    Overview

    NASA Dependable Multiprocessor

    Reconfigurable Fault Tolerance (RFT)

    Space Applications

    Novel Computing Platforms

    RapidIO

    Conclusions

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    3/19

    3

    OverviewOverview What is advanced space computing?

    New concepts, methods, and technologies to enable and deploy high-performancecomputing in space for an increasing variety of missions and applications

    Why is advanced space computing vital? On-board data processing

    Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing

    Remote data processing from Earth is no longer viable

    Must process sensor data where it is captured, then downlink results

    On-board autonomous processing & control Remote control from Earth is often not viable

    Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy

    Autonomy requires high-speed computing for decision-making

    Why is it difficult to achieve? Cannot simply strap a Cray to a rocket!

    Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc.

    Traditional space processing technologies (RadHard) are severely limited

    Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies

    Must ensure high levels of reliability and availability

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    4/19

    4

    Taxonomy of Fault ToleranceTaxonomy of Fault Tolerance

    First, let us define various possible modes/methods of providing fault tolerance (FT) Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem

    Software FT vs. hardware FT concepts largely similar, differences only at implementation level

    Radiation-hardening not listed, falls under prevention as opposed to detection or correction

    DetectCorrect

    orMask

    Fault-TolerantHLL (e.g. MPI)

    FT-HLL

    Concurrent ErrorDetection

    CED

    Self-CheckingPairs

    SCP

    Algorithm-BasedFault-Tolerance

    ABFT

    Error CorrectionCodes

    ECCN-Version

    Programming

    NVP

    ByzantineResilience

    BR

    Checkpointing

    & Roll-back

    CR

    Software-ImplementedFault Tolerance

    SIFTN-Modular

    Redundancy

    NMR

    Temporaland spatialvariants possiblefor many techniques

    Most of these FTmodes are currentlybeing used at UF

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    5/19

    5

    NASA/Honeywell/UF ProjectNASA/Honeywell/UF Project1st Space Supercomputer

    Funded by NASA NMP

    In-situ sensor processing

    Autonomous control Speedups of 100 to 1000

    First fault-tolerant, parallel,reconfigurable computer for space

    Infrastructure for fault-tolerant,high-speed computing in space

    Robust system services

    Fault-tolerant MPI services

    Application services FPGA services

    Standard design framework

    Transparent API to resources for

    earth & space scientists

    NASA Dependable Multiprocessor (DM)

    SystemController

    B

    SystemController

    A(RHPPC) Data

    Processor(PPC, FPGA)

    #1

    Spacecraft I/FMission -Specific

    Devices

    Instruments

    . . .

    High-Speed Network A

    Mission -Specific

    Spacecraft Interface

    Spacecraft I /F

    Spacecraft I /F

    High-Speed Network B

    DataProcessor

    (PPC, FPGA)

    #N

    ReconfigurableReconfigurable

    ClusterCluster

    ComputerComputer

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    6/19

    6

    Dependable MultiprocessorDependable Multiprocessor

    DM System Architecture

    Dual system controllers

    Redundant radiation-hardened PPC

    boards Monitor data processors health and

    communicate with spacecraft

    Data processing engines

    High-performance, low-power

    COTS SBCs running Linux PowerPC with AltiVec capabilities

    Optional FPGA co-processor for

    additional performance

    Scalable to 20 data processing

    nodes

    Redundant Interconnect

    Dual GigE connections

    Automatically switch networks when

    error is detected

    DM Middleware (DMM)

    FT System Services Manages status and health of

    multiple concurrent jobs

    FT Embedded MPI (FEMPI) Lightweight subset of MPI

    Allows fault recovery withoutrestarting an entire parallelapplication

    Application & FPGA Services Commonly used libraries such as

    ATLAS, FFTW, GSL

    Simplified, generic API for FPGAusage through USURP*

    High-Availability Middleware Framework used to enable health

    monitoring of cluster

    * USURP is a standardized interface

    specification for RC platforms,

    developed by researchers at UF

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    7/19

    7

    DMM ComponentsDMM Components

    Mission Manager (MM)

    Controls high-level job deployment

    Facilitates replication of lower-leveljobs

    Spatial or temporal replication

    Automatically compares andvalidates outputs

    Monitors real-time deadlines

    Enables roll-forward / roll-backwhen faults occur

    Job Manager (JM)

    Controls low-level job deploymentand scheduling across system

    FT Manager (FTM)

    Manages low-level system faults(node crash, job crash)

    JM Agent (JMA)

    Deploys and monitorsprograms on given node

    Provides application heartbeatto system controller

    Mass Data Store (MDS) Provides reliable centralized data

    services

    Enables reliable checkpointing

    Hardened Processor COTS Packet-Switched Network COTS Processor

    COTS OS and Drivers COTS OS and Drivers

    Reliable Messaging Middleware

    JM FTM

    Reliable Messaging Middleware

    JMA ASL

    JM Job Manager FEMPI Fault-Tolerant Embedded MPIJMA Job Manager Agent ASL Application Services Library

    FTM Fault Tolerance Manager FCL FPGA Coprocessor Library

    Hardened System

    COTS Data Processors

    FCL FEMPI

    MPI Application Process

    Mission-Specific Parameters

    Mission Manager

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    8/19

    8

    AlgorithmAlgorithm--Based Fault ToleranceBased Fault Tolerance

    Commonly refers to matrix coding method that ispreserved through certain linear algebra operations Matrix and vector multiply

    Discrete Fourier Transform Discrete Wavelet Transform

    Matrix decomposition: C = AB (LU, QR, Cholesky) Matrix inversion

    Used to detect errors in these operations, and in certaincases allows for error correction

    ABFT algorithms integrate with DM through ApplicationServices API

    An improved method of using ABFT on the 2D-FFT andSAR has been researched at UF Uses Hamming encoding Low overhead due to ABFT

    Important aspects of ABFT currently under investigation

    at UF Round-off analysis Coverage analysis Code types Encoding and Decoding strategies Overhead

    Fault-tolerant Partial Transform

    Computation Flow of Fault-tolerant 2D-FFT

    15%

    25%

    35%

    45%

    55%

    65%

    75%

    85%

    95%

    128 256 512 1024 2048 4096

    Image Size [N x N]

    OverheadIncurred

    Error Free

    With Error

    Experimental Overhead of Fault-tolerantRDP vs. a Fault-intolerant Version

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    9/19

    9

    Source Code TransformationsSource Code Transformations

    Most science applications are inherently non-fault-tolerant Requires SIFT framework to improve reliability Possible to immunize programs against most errors by

    transforming application source code Less overhead

    More control over FT techniques

    Compiler-independent Integrates with DM system through Application Services API

    Custom source-to-source (S2S) transformation tool iscurrently under development at UF Accepts C source files as inputs Generates fault tolerant versions

    Uses fine-grain NMR-type of approach to provide improvedreliability and dependability Provides means of control flow checking (CFC) through software Minimizes number of undetected errors

    Transformation options to be supported by the tool Variable replication Function replication Memory duplication / memory checking Synchronization intervals

    Condition evaluation Post-evaluation verification Evaluation using replicated variables

    Block protection

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    10/19

    10

    Reconfigurable Fault ToleranceReconfigurable Fault Tolerance GOAL Research how to take advantage of reconfigurable nature of FPGAs, to provide

    dynamically-adaptive fault tolerance in RC systems

    Leverage partial reconfiguration(PR) where advantageous

    Explore virtual architectures to enable PR and reconfigurable

    fault tolerance(RFT)

    MOTIVATION Why go with fixed/static FT, when

    performance & reliability can be tuned as needed?

    Environmentally-aware & adaptive computing is wave of future

    Achieving power savings and/or performance improvement,

    without sacrificing reliability

    CHALLENGES limitations in concepts and tools,

    open-ended problem requires innovative solutions

    Conventional methods typically based upon radiation-

    hardened components and/or fault masking via chip-level TMR

    Highly-custom nature of FPGA architectures in different systems

    and apps makes defining a common approach to FT difficult

    Satellite orbits, passing throughthe Van Allen radiation belt

    Fault Tolerance

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    11/19

    11

    Reconfigurable FTReconfigurable FT Virtual Architecture for RFT

    Novel concept of adaptablecomponent-level protection (ACP)

    Common components within VA: Adaptable protection frame largely module/design-independent (see figure above)

    Error Status Register (ESR) for system-level error tracking/handling

    Re-synchronization controller or interfaces, for state saving and restoration

    Configuration controller, two options: Internal configuration through ICAP

    External configuration controller

    Benefits of internal protection: Early error detection and handling = faster recovery

    Redundancy can be changed into parallelism

    PR can be leveraged to provide uninterruptedoperation of non-failed components

    Challenges of internal protection: Impossible to eliminate single points of failure, may still

    need higher-level (external) detection and handling

    Stronger possibility of fault/error going unnoticed

    Single-event functional interrupts (SEFI) are majorconcern

    A BB

    2 parallel, SCP

    A

    no parallel, TMR

    BA DC

    4 parallel, single

    BLA

    NK

    BLA

    NK

    no parallel, SCPsockets for modules

    AdaptableComponent-

    levelProtection

    VA concept diagram

    FPGA

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    12/19

    12

    Space ApplicationsSpace Applications Synthetic Aperture Radar (SAR)

    Used to form high-resolution images of Earthssurface from moving platform in space

    Patch-based processing with significant amountof overlap between patch boundaries

    Parallelizable on multiple levels of granularity,possible without need for anyinter-processorcommunication (one patch per node)

    2-dimensional data set, can range in size fromseveral hundred Megabytes to Gigabytes

    Data set notsignificantly reduced through course

    of application Highly amenable to ABFT; possible application for

    the Dependable Multiprocessor project

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    13/19

    13

    Space ApplicationsSpace Applications Hyperspectral Imaging (HSI)

    Uses traditional beamforming techniques toperform coarse-grained classification onhyperspectral images

    Adjustable to enable real-time processing

    Mostly embarrassingly parallel, exception beingweight computation (shown in red below)

    3-dimensional data set, reduced through course ofapplication

    Auto-correlation sample matrix (ACSM) calculationand beamforming (detection) amenable to ABFT

    Suggest NMR for weight computation (weight) Parallel and multi-FPGA decompositions explored

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    14/19

    14

    Space ApplicationsSpace Applications Cosmic Ray Elimination

    Uses image processing techniques to remove artifactscaused by cosmic rays

    Image shows pre- and post-processed versions of a HubbleTelescope observation

    Images are highly parallelizable, with minimalcommunication necessary

    Main computation: median filtering

    Fault-tolerant median filter developed

    Other portions of algorithm replicated by hand or S2Stranslator

    Other aerospace-related application kernels

    Space-Time Adaptive Processing (STAP)

    Ground Moving Target Indicator (GMTI)

    Airborne LIDAR

    Digital Down Conversion (DDC)

    PDF Estimation

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    15/19

    15

    Novel Computing PlatformsNovel Computing Platforms

    Fixed multi-core (FMC) devices Cell

    Heterogeneous, vector compute engine, 3.2 GHzclock rate, ~70 W max. power consumption

    GPU Homogeneous, many (e.g. 100+) stream processors,

    ~1.5 GHz clock rate, ~120 W max. powerconsumption

    Reconfigurable multi-core (RMC) devices Field-Programmable Object Array (FPOA)

    Heterogeneous, coarse-grained processingelements, 1 GHz clock rate, ~35 W max powerconsumption

    Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements,

    max. clock rate ~500 MHz, achievable clock ratevaries, ~30 W max. power consumption

    Tilera Homogeneous, coarse-grained processing elements

    (64 32-bit MIPS-like processors on-chip), ~750 MHzclock rate, ~30 W max. power consumption

    Element CXi Heterogeneous, coarse-grained processing

    elements, 200 MHz clock rate, ~1 W max. powerconsumption

    Cell processor block diagram -http://www.research.ibm.com/journal/rd/494/kahle.html

    FPOA architecture -http://www.mathstar.com/Architecture.php

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    16/19

    1616

    RC: Vital Technology for SpaceRC: Vital Technology for Space

    HPEC devices featuredhere; similar results vs.65nm Xeon, 90nm GPU,etc. (see RSSI08).

    Results excerpted frompending presentationfrom CHREC-UF site forHPEC08 Workshop.

    Versatility in space missions (adapts as needs demand)

    Fixed archs. burdened with fixed choices, limited tradeoffs

    Performance in space missions (speed, power, size, etc.)

    e.g. Computational density per Watt (CDW) device metric FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.)

    Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible

    Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]

    Power scales linearly with

    resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]

    Parallel Operations scales upto max. # of adds and mults (#of adds = # of mults) possible

    Achievable Frequency lowestfrequency after PAR of DSP &logic-only impls. of add & multcomp. cores [FPGA]

    Power scales linearly with

    resource util; max. powerreduced by ratio of achievablefreq. to max. freq. [FPGA]

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    17/19

    17

    RapidIORapidIO High-speed embedded system

    interconnect, replacement for bus-based backplanes

    Parallel and serial variants, serial is wave of future

    Multiple programming models

    Research with RapidIO at UF Simulative research studying capability of RapidIO-based

    computing platforms to support space-based radar (SBR)processing

    Custom testbed designed and built, for verification ofsimulation models & experimentation with RapidIO & FPGAs

    256 Pulses, 6 Beams, 1 Engine per Task per FPGA: 64k Ranges

    0

    10

    20

    30

    40

    50

    6070

    80

    90

    100

    0 256 512 768 1024 1280 1536 1792 2048

    Time (ms)

    SDRAMUtilization(%)

    Experimental logic analyzer measurements

    Visualization of simulated GMTI application progress

    Trace files

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    18/19

    18

    ConclusionsConclusions

    Fault tolerance for space should be more thanRadHard components & spatial TMR designs

    Fixed worst-case designs extremely limited in perf/Watt

    Instead, many FT methods & modes can be exploited

    Adaptive systems that react to environmental changes

    COTS featured inside critical performance path

    RadHard for FT management, outside critical perf. path

    UF active on many space-related FT issues

    NASA Dependable Multiprocessor, CHREC RFT F4-08 Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc.

    Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc.

    Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.

  • 8/3/2019 Cieslewski UF FTWorkshop Final2

    19/19

    19

    2009 IEEE Aerospace Conference2009 IEEE Aerospace Conference

    Track 7.12 Dependable Software for High PerformanceEmbedded Computing Platforms Transient error detection and recovery techniques

    Compiler-based fault-tolerant techniques Algorithm-based fault-tolerant techniques

    Tools and techniques for designing reliable software SIFT management frameworks Software dependability analysis Adaptive fault-tolerant techniques FT applications

    Track Chairs Richard Linderman [email protected] Grzegorz Cieslewski [email protected]

    Dates Abstract Submissions Due: July 1st, 2008 Paper Submissions Due: November 2nd, 2008