38
06 December 2007 FPGA Self-Repair FPGA Self-Repair using an using an Organic Embedded System Architecture Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central Florida University of Central Florida

06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Embed Size (px)

Citation preview

Page 1: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

06 December 200706 December 2007

FPGA Self-Repair FPGA Self-Repair using anusing an

Organic Embedded System ArchitectureOrganic Embedded System ArchitectureKening Zhang, Jaafar Alghazo and Ronald F. Kening Zhang, Jaafar Alghazo and Ronald F.

DeMara DeMara University of Central FloridaUniversity of Central Florida

Kening Zhang, Jaafar Alghazo and Ronald F. Kening Zhang, Jaafar Alghazo and Ronald F. DeMara DeMara

University of Central FloridaUniversity of Central Florida

Page 2: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Reconfigurable Hardware with Self-Healingbased on SRAM FPGA platform

Organic Computing (OC)biologically-inspired computing with “self-x” properties

Communication networks among

autonomous systems

Self-x Characteristics

System Property

Composed of large collection of

autonomous systems

•Self-organization•Self-configuration•Self-optimization

Autonomous system owned sensor and

actuators

•Self-healing•Self-protection•Self-explaining

•Context-awareness•Self-synchronization

Technical Objective:

OC Approach: addresses system controllability with increasing complexity

Example Relevance:How to achieve sustainable presence in NASA’s Moon, Mars & Beyond objective???

Reliability Availability Sustainability

support long lifetime missions with multiple failure occurrences

Research Focus:

Sponsors: NASA: FPGA platform and Genetic Algorithm research DARPA: OC approach and SOAR Longevity Platform

Page 3: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Goal: Autonomous FPGA Refurbishment

Redundancy

increases with amount of spare capacity

restricted at design-time

based on time required to select spare resource

determined by adequacy of spares available (?)

yes

Refurbishment

weakly-related to number

recovery capacity

variable at recovery-time

based on time required to find suitable recovery

affected by multiple characteristics (+ or -)

yes

Overhead from Unutilized Spares weight, size, power

Granularity of Fault Coverage resolution where fault handled

Fault-Resolution Latency availability via downtime required to handle fault

Quality of Repair likelihood and completeness

Autonomous Operation fix without outside intervention

increase availability without carrying pre-configured spares …

Page 4: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Device Failure

Duration:

Target:

Detection:

Isolation:

Diagnosis:

Recovery:

Transient: SEU Permanent: SEL, Oxide Breakdown, Electron Migration, LPD

Scrubbing

DeviceConfiguration

Approach: TMRBIST

Processing Datapath

DeviceConfiguration

Processing Datapath

Evolutionary

Bitwise Comparison

Reload Bitstream/ Invert Bit Value

IgnoreDiscrepancy

MajorityVote

STARS

SupplementaryTestbench

CartesianIntersection

Worst-caseClock Period

Dilation

Replicate inSpare Resource

Characteristics

MethodsCED

Duplex Output

Comparison

Fast Run-time Location

Select SpareResource

Vigander

Duplex/TriplexOutput

Comparison

(not addressed)

(not addressed)

unnecessary Autonomous Supervisor (AS)

Autonomous Element (AE)

Population-basedGA using

Extrinsic FitnessEvaluation

EvolutionaryAlgorithm usingIntrinsic Fitness

Evaluation

Fault-Handling Techniques for SRAM-based FPGAs

OC

Page 5: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Autonomous System-on-a-Chip (ASoC) Architecture

Dual-layer ASoC proposed by Lipsa et al [Lipsa 05]• Functional Layer

• Functional Elements (FEs) e.g. CPU, RAM, Network interface• Autonomic Layer

• Autonomic Elements (AEs)• Monitor• Actuator• Communication interface

• Autonomic Supervisor (AS)

UCF Approach for fault coverageFunctional Layer & Autonomic Layer• achieved by assessing consensus

among elements 1. first to realize failure detection2. consensus provides an organic method for fitness evaluation of competing alternatives during evolution providing a self-regulating approach to fault resolution

Page 6: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

EHW Environments

• Evolvable Hardware (EHW) Environments enable experimental methods to research soft computing intelligent search techniques

• EHW operates by repetitive reprogramming of real-world physical devices using an iterative refinement process:

Genetic

Algorithm

Hardware in the loop

orTwo

modes

of

Evolvabl

e

Hardwar

e

Extrinsic Evolution

Genetic

Algorithm

software modelDone? Build it

device “design-time”refinement

Simulation in the loop

Intrinsic Evolution

device “run-time”refinement

new approach to

Autonomous Repair

of failed devices

Deep Space Satellite: • >100 FPGAs onboard• hostile environment: radiation, thermal stress• How to achieve reliability to avoid mission failure???

Application

Page 7: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Genetic Algorithms (GAs)

Mechanism coarsely modeled after neo-Darwinism (natural selection + genetics)

selection of

parents

population of candidate solutions

parents

offspring

crossover

mutation

evaluatefitness

ofindividuals

replacement

start

Fitnessfunction

Goal reached

Page 8: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Genetic Mechanisms

• Guided trial-and-error search techniques using principles of Darwinian evolution iterative selection, “survival of the fittest” genetic operators -- mutation, crossover, … implementor must define fitness function

• GAs frequently use strings of 1s and 0s to represent candidate solutionsGenotype chromosomes of GA operation: if 100101 is better than 010001 it will have more chance to

breed and influence future population

Genotype changes during evolution must adhere to the Xilinx-defined format of bitstream

To prevent undesirable conditions that may damage the FPGA such as a mutation which has two logic outputs tied together, a logical genotype is used for evolution and mapped to physical phenotype

Logic # = functional logic index number for LUTRow/Column= physical location of LUT in FPGA

• Can invoke Elitism Operator (E=1, E=2 …) guarantees monotonically increasing fitness of best individual over all generations

Page 9: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Avnet FPGA Development Board

PCI I nt er f ace

Virtex-IIPro FPGA

Off ChipRAM

Controlhosted on

PC

FP

GA

Ou

tp

ut

Bit file

Input Data

Loosely Coupled Solution on Xilinx Virtex II Pro & Virtex 4

The entire system operates on a The entire system operates on a 32-bit basis32-bit basis

The The Virtex 2Pro/4Virtex 2Pro/4 is mounted on a is mounted on a development board which can then development board which can then

be interfaced with a WorkStation be interfaced with a WorkStation running running XilinxXilinx EDK and ISE. EDK and ISE.

Page 10: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Organic Embedded System (OES) Architecture

One Dimensional Column-oriented OES based on Xilinx Virtex II Pro FPGA platform

• FEs and AEs reside on two distinct layers with interconnection structure between them• AEs and FEs can either be realized in hardware, software, or co-design• AE layer supervises functionality of FE elements while requiring no application-specific algorithms on

the AE layer• Observer/Controller architecture includes an AS element which had no counterpart to evaluate if the AS

fault-free, so address by minimizing its complexity in proposed approach• utilize Xilinx partial reconfiguration technology to manipulate relocatable bitstreams

Page 11: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

OES AE Component Design

AEs decentralize Observer/Controller functionality:• Concurrent Error Detection (CED) unit collects 2 FE Outputs for

discrepancy identification • A Checksum for AE fault detection which are checked against Stored

Checksum values • Evaluator of outputs from 2 FEs against checksum and Actuator which

initiates recovery phase• An important architectural property is that all AE components are

identical in structure despite the fact that they monitor different types of FEs.

• Homogeneous characteristics deliver a uniform-behavior property leveraged for consensus-based evaluation fault-handling methodology

• OC Concept: although AE components add an additional complexity to the design, they will ease integration of fault-handling difficulties inherent with current commercial IP cores

Page 12: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Consensus-Based Evaluation (CBE)

• Uses a Relative Fitness MeasureUses a Relative Fitness Measure Pairwise discrepancy checking yields relative fitness measurePairwise discrepancy checking yields relative fitness measure Broad temporal consensus in the population used to determine Broad temporal consensus in the population used to determine

fitness metricfitness metric Transition between Transition between Fitness States Fitness States occurs in the populationoccurs in the population Provides graceful degradation in presence of changing Provides graceful degradation in presence of changing

environments, applications and inputs, since this is a moving environments, applications and inputs, since this is a moving measuremeasure

• Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput CBE does not utilizes additional functional nor resource test CBE does not utilizes additional functional nor resource test

vectorsvectors Potential for higher availability as regeneration is integrated Potential for higher availability as regeneration is integrated

with normal operationwith normal operation

Page 13: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Genetic Operators: Mutation

Mutation: Genotype chromosomes

Mutation: Phenotype chromosomes

• original functionality isF = F1·(F3+ F4) w/ input F2 unassigned by synthesis tool

• mutation operator will change input F4 to unused as F = F1·(F3+ F2)

• shadow shows changed input and LUT contents

• some opportunity for input stuck-at fault or LUT content stuck-at fault.

• functionalities of LUTs remain undistorted while search space explored

Typical Approach: bit inversion of LUT functionality Selected Approach: input interconnection of LUTs mutated

Rearrange input interconnection to search unused LUT resources which occlude faulty resource

Page 14: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Genetic Operators: Cell Swapping

Cell-Swap operation on Genotype chromosomes

Cell-Swap operation on Phenotype chromosomes

interchanges two distinct LUT blocks while maintaining correct logic order and functionalities in genotype

• exchange all LUT input interconnections, LUT content and physical 2-tuple (Col#, Row#) as well as the logic sequence

Page 15: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Genetic Operators: PMX Operator

Partial Match Crossover (PMX) maintains crossover information as well as order information

• two genotype configuration streams are aligned at LUT boundary• crossover site selected at random along LUT boundary• this crossover point defines a left/right partition used to affect crossover through LUT-by-LUT exchange • suppose crossover point at position 4 of the LUT vector:

• first step is to map configuration B to configuration A by exchanging the following aligned LUTs {(4,7),(5,2),(6,1),(7,5)}. •Applying PMX results in two new configurations A’ and B’

Page 16: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Illustrative Example:Gate Level Design of OES

• Experiment circuit: 1-bit Full-adder

• Fault-free model: Duplex• Fault-impact model: TMR• Fault-detect model: CBE• Fault recovery strategy: GA

operation• Experimental setup:

Hardware prototype implemented in Xilinx Virtex-II Pro FPGA

VHDL implementation Using the GNAT library along with

the MRRA framework and JTAG reconfiguration interface.

Page 17: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

MCNC-91 Benchmark Case Studies

System Availability under Multiple Faults

Circuit Name Circuit Function Inputs Outputs Approximate Gates

z4ml 2-bit Add 7 4 20

cm85a logic 11 3 38

cm138a Logic 6 8 17

Fc = number of correct behaviors of FE observed during evolutionary recovery phaseFe = number of errant or discrepant behaviors 1 = exactly one output required to detect the fault during the original CED

configuration. 2 = number of the reconfigurations required, i.e. one from CED to TMR, and one back

from TMR to CEDFc1 & Fe1 = correct and faulty output number of the FE during the AE repair periodFc2 & Fe2 = correct and faulty output number during the FE repair period n = number of reconfigurations of the FEβ represents reconfiguration to computation time ratio

Page 18: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Experimental Results

Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs

Fc = number of correct behaviors of FE observed during evolutionary recovery phase

Fe = number of errant or discrepant behaviors

n = number of reconfigurations of the FE

β represents reconfiguration to computation time ratio

• Fault Free arrangement: CED FEs with cold standby FE

• Inject a stuck-at-zero or stuck-at-one fault at one of the FE’s LUT input pins

• CED -> TMR to identify faulty FE or AE

• CBE used to resolve faulty AE

Page 19: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Experimental Results

Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs

Fc = number of correct behaviors of FE observed during evolutionary recovery phase

Fe = number of errant or discrepant behaviors

n = number of reconfigurations of the FE

β represents reconfiguration to computation time ratio

• Fault Free arrangement: CED FEs with cold standby FE

• Inject a stuck-at-zero or stuck-at-one fault at one of the FE’s LUT input pins

• CED -> TMR to identify faulty FE or AE

• CBE used to resolve faulty AE

Page 20: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Experimental Results

Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs

Fc = number of correct behaviors of FE observed during evolutionary recovery phase

Fe = number of errant or discrepant behaviors

n = number of reconfigurations of the FE

β represents reconfiguration to computation time ratio

• Fault Free arrangement: CED FEs with cold standby FE

• Inject a stuck-at-zero or stuck-at-one fault at one of the FE’s LUT input pins

• CED -> TMR to identify faulty FE or AE

• CBE used to resolve faulty AE

Page 21: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Conclusion

• A self-adaptation and self-healing OES architecture developed for autonomic operation without human intervention.

• The OES architecture is capable of handling many single fault scenarios and several multiple fault scenarios for small digital logic design.

• Experimental result support our design objectives during the repair phase averaged 75.05%, 82.21%, and 65.21% for the z4ml, cm85a, and cm138a circuits respectively under stated conditions.

• Reconfiguration time ratio (β) ratio is key factor limiting availability during AE repair

• Future work: evaluate extensions of the OES architecture addressing scalability of in terms of pipelined stages

Page 22: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Backup Slides

• On following pages …

Page 23: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Isolation of a single faulty individual with 1-out-of-64 impact

• Outliers are identified after EW iterations have elapsed• Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault• Isolated faulty individual’s DV differs from the average DV by 33 after 1 or more observation intervals of

length EW

instantaneous DV (point

values) for a sample

individual in population

and

population oracles (solid

lines)

Sliding Window

Page 24: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Future Work:Development Board to Self-Contained FPGA

Year 1 Year 3Year 2

CRR on a Chip(Xilinx Virtex-II Pro)

Control viaon-chip

Power PC

Re-config

Config

Data

Configurationsin On ChipRAM Blocks

FunctionalCLBs

ICAP

Bit file

Data

Output

Request

Avnet FPGA Development Board

PCI Interface

Virtex-IIPro FPGA

Off ChipRAM

Controlhosted on

PCOutput

Bit file

Input Data

CRR on a Chip(Xilinx Virtex-II Pro)

Device Fault

Qualitative Analysis of CRR modelQualitative Analysis of CRR model• Number of iterations and completeness of regeneration repair • Percentage of time the device remains online despite physical resource

fault (availability)Hardware Resource ManagementHardware Resource Management

• Optimization of hardware profile for Xilinx Virtex II ProField Testing on SRAM-based FPGA in a Cubesat missionField Testing on SRAM-based FPGA in a Cubesat mission

Page 25: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

OES Integrated FE and AE Failure

Detection Procedure

• System Initialization FE Initialization step Compute Checksum step

• FE Fault Detection/Recovery AE-CED fault detection FE fault-recovery

• AE fault detection Phase A fault may exist in the CED,

Actuator, or Evaluator, A fault may exist in Check Sum

component, or A fault may exist in the Stored

CheckSum-LUT.

Runtime inputs to FE applied to both active instance under a CED strategy. After allowing for FE inputs propagation time through the AE, the

expected output will be supplied to AE-CED for the fault detection. The output of the FE is then compared in the AE-CED module and any

discrepancy between the two values will indicate that a fault has occurred either of one the FE or the AE-CED itself. Further detection will be required to

distinguish which of the two is faulty.If the AE component is identified as innocent and

then the fault must of occurred in this output will be discarded and control will branch to a fault

identification phase which will wakeup the cold standby FE and construct a temporary TMR system

which can articulate the faulty FE under the new supplied external input. Furthermore, as descrived

in Section 3.3, the actuator will initiate a repair cycle which may require automatic evolutionary

repair of the identified faulty FE which will be set as standby-under-repair and the AE-CED will return to receive the remaining two active FEs’ inputs. The decision-making procedure causes at least one

throughput-delay penalty

Page 26: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Previous Work

Detection Characteristics of FPGA Fault-Handling SchemesDetection Characteristics of FPGA Fault-Handling Schemes

Fault Detection

Resource Coverage

Fault Isolation

Approach Fault Handling Method Latency Distinguish Transients

Logic Inter-

connect Comparator Granularity

TMR Spatial voting Negligible No Yes Yes No Voting element

[Vigander01] Spatial voting & offline

evolutionary regeneration Negligible No Yes No No Voting element

[Lohn, Larchev,

DeMara03]

Offline evolutionary regeneration

Negligible No Yes Yes No Unnecessary

[Lach98] Static-capability tile

reconfiguration Relies on independent fault detection mechanism

STARS [Abramovici01]

Online BIST Up to 8.5M

erroneous outputs Test pattern transients

Yes Yes No LUT function

[Keymeulen, Stoica,

Zebulum00]

Population-based fault insensitive design

Design-time prevention emphasis

No Yes Yes No Not addressed

at runtime

CRR Competing configurations with temporal voting and

online regeneration Negligible

Transients are attenuated

automatically Yes Yes Yes

Unnecessary, but can isolate functional

components

… Strategy #1) Evolve redundancy into design before the anticipated failure or …

Page 27: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Approach Online Recovery

Basis for Recovery

Quality of Recovery

Availability Externally-supplied Elements

Resource Recycling

Pre-determined

Limits

Power Consumption

TMR Yes Requires 2

datapaths are operational

Either

complete or none

100% for single fault,

0% thereafter 2 of 3 Majority Voter No Single

datapath

3n

[Vigander01] No Design complexity

Non-deterministic

Non-deterministic

GA Controller, function test vectors

Yes None 3n+r

[Lohn, Larchev,

DeMara03] No Design

complexity Non-

deterministic Non-

deterministic GA Controller,

function test vectors Yes None 2n+r

[Lach98] No Available spares

Either

complete or none

Either

complete or none

Device test vectors and controller

No Only one

faulty CLB per tile

2n+i+r

STARS

[Abramovici01]

Yes Available spares

Restricted by non-

optimizable re-routing

Only ~93% regardless of

fault occurrence

Test Reconfiguration Controller + device

test vectors Yes

Available spares within

routing chokepoints

s • (c+m+b)

[Keymeulen, Stoica,

Zebulum00] No

Depends on characteristics at design time

Non-deterministic

Non-deterministic

None at runtime No Depends on redundancy

during design n • (1 + f(g))

CRR Yes Recovery complexity

Optimized by second-order fitness metric

Adaptable

Optional RAM. RAM coverage is intrinsic.

No test vectors.

Yes None 2n+r

Fault Recovery Characteristics of Selected ApproachesFault Recovery Characteristics of Selected Approaches

Previous Work

… Strategy #2) Evolve recovery from specific failure after (and if) it occurs or …

Page 28: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

CRR Arrangement in SRAM FPGA

Configurations in PopulationConfigurations in Population• C = CL CR

• CL = subset of left-half configurations• CR = subset of right-half configurations• |CL|=|CR |= |C|/2

Discrepancy OperatorDiscrepancy Operator• Baseline Discrepancy Operator is dyadic operator with binary output:

• Z(Ci) is FPGA data throughput output of configuration Ci

• Each half-configuration evaluates using embedded checker (XNOR gate) within each individual

• Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair

Othewise

CZCZCC

Ri

LiR

iLi

)()(

1

0

Reconfiguration Algorithm

`

SR A M-based FPGA

LHalf-Configuration

Discrepancy Check L Discrepancy Check R

Function Logic L

CONFIGURATION BIT STREAM

INPUT DATA

Function Logic R

DATA OUTPUT

FEE

DB

AC

K

RHalf-Configuration

CONTROL

OFF

-CH

IP E

EPR

OM

( NO

TE: a

non

-vol

atile

mem

ory

is a

lread

y re

quire

d to

boo

t any

SR

AMFP

GA

from

col

d st

art .

.. th

is is

not

an

addi

tiona

l chi

p )

Rji

Ljii CEORC ,,j =RS:

(Hamming Distance)

Rji

Ljii CEORC ,,j ^ =WTA:

(Equivalence)

Page 29: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Terminology and Characteristics

Pristine Pool: Pristine Pool: CP. For any CiC, is member of CP at generation G if and only if

Suspect Pool:Suspect Pool: CS. For any CiC, is member of CS at generation G if and only if

at least one of

Under Repair Pool:Under Repair Pool: CU: For any CiC, is member of CU at generation G if and

only if

Refurbished Pool:Refurbished Pool: CR: after Genetic Operator applied, the new generated individual is member of CR at generation G if and only if

01

G

K

RK

LK CC

)1(0 GKCC RK

LK

11

G

K

RK

LK CC

01

G

K

RK

LK CC

ED is Discrepancy CountDiscrepancy Count of Ci and EC is Correctness CountCorrectness Count of Ci

Length of Evaluation Fitness Window:Length of Evaluation Fitness Window: W = ED+ EC

Fitness Metric:Fitness Metric: f(Ci) =EC/ EW

Page 30: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

1.1. InitializationInitialization Population P of functionally-identical yet physically-distinct configurations Partition P into sub-populations that use supersets of physically-distinct resources,

e.g. size |P|/2 to designate physical FPGA left-half or right-half resource utilization

2.2. Fitness AssessmentFitness Assessment Discrepancy Operator is some function of

bitwise agreement between each half’s output

Four Fitness States defined for Configurations as

{CP,CS,CU,CR} with transitions, respectively:

Pristine Suspect Under Repair Refurbished

Fitness Evaluation Window W determines comparison interval

3.3. RegenerationRegeneration Genetic Operators used to recover from fault based on Reintroduction Rate

Operators only applied once then offspring returned to “service” without for concern about increasing fitness

Sketch of CRR ApproachPremise: Recovery Complexity << Design Complexity

fitness assessment viafitness assessment via

pairwise discrepancypairwise discrepancy (temporal voting vs. (temporal voting vs.

spatial voting)spatial voting)

Page 31: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

States Transitions during lifetime of iStates Transitions during lifetime of ithth Half-Configuration Half-Configuration

Configuration Health States

pristine

suspect

refurbished

under repair

partial repair

L R

L = R

complete repair

primordial

L = R

L R

L R

L = R

L = R

LR

1

2

3

4

5

6

7

8

fi fOT

:L = R

: fi fOT

9

10

11

fi < fRT

L R:

fi < fRT

L R:

integral w ith

:fi fRT

:fi < fOT

COMPETITION

C O M P E T I T I O N

E V O L U T I O N

Page 32: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Procedural Flow under Competitive Runtime Reconfiguration

Initialization Population partitioned into

functionally-identical yetphysically-distincthalf-configurations

Fitness Adjustment

update fitness of onlyL and R based ondetection results

either L's or R'sfitness < Repair

Threshold?

Selectionchoose

FPGA configuration(s)labeled L and R

Detectionapply functional inputs

to compute FPGAoutputs using L, R

Adjust Controlsdetection mode, overlap interval, ...

invoke

GeneticOperators only once

and only on L or R

L=R

L=R

PRIMARYLOOP

discrepancyfree

L, R results

NO

YES

is

Integrates all fault handling stages using EC strategyIntegrates all fault handling stages using EC strategy Detects faults by the occurrence of discrepancy Isolates faults by accumulation of discrepancies Failure-specific refurbishment using Genetic Operators:

Intra-Module-Crossover, Inter-Module-Crossover, Intra-Module-Mutation

Realize online device refurbishmentRealize online device refurbishment Refurbished online without additional function or resource test vectors Repair during the normal data throughput process

Page 33: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Fitness Evaluation Window

• Fitness Evaluation WindowFitness Evaluation Window: W denotes number of iterations used to evaluate fitness before the state of

an individual is determined

• Determination ofDetermination of W for 3x3 multiplierfor 3x3 multiplier 6 input pins articulating 26=64 possible inputs W should be selected so that all possible inputs appear More formally,

Let rand(X) return some xi X at random

Seek W : [ rand(X) ] = X with high probabilityi=1

W

1

112

.....1

12.....

1

1

121

121

m

K

m

KK

DKK

Pm

K

xK

PK

PK

KP

K

K

KxK

xK

xK

Kx

K

K• xK = distinct orderings of K inputs showing in D trials

• if D constant, can calculate Pk>1 successively

• probability PK of K inputs showing after D trials is ratio of xK / KD

Page 34: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

When K=64:

W Determination

Page 35: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Integer Multiplier Case Study

• 3bit x 3bit unsigned multiplier3bit x 3bit unsigned multiplier automated design:esign:– Building blocks

Half-Adder: 18 templates created Full-Adder: 24 templates Parallel-And : 1 template created

– Randomly select templates for instantiation in modules

GA operatorsGA operatorsExternal-Module-CrossoverInternal-Module-Crossover Internal-Module-Mutation

GA parametersGA parametersPopulation size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit

Experimental EvaluationExperimental EvaluationXilinx Virtex II Pro on Avnet PCI board • Objective fitness function replaced by Objective fitness function replaced by

the Consensus-based Evaluation the Consensus-based Evaluation Approach and Relative FitnessApproach and Relative Fitness

• Elimination of additional test vectorsElimination of additional test vectors• Temporal Assessment processTemporal Assessment process

Experiments Demonstrate …Experiments Demonstrate …

Page 36: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Template Fault Coverage

Half-Adder Template A

Half-Adder Template B

Template ATemplate A– Gate3 is an AND gate– Will lose correctness if a Stuck-At-Zero fault occurs in second

input line of the Gate3, an AND gate

Template BTemplate B – Gate3 is a NOT gate and only uses the first input line– Will work correctly even if second input line is stuck at Zero or

One

Half-Adder Template A

Page 37: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Regeneration PerformanceRegeneration Performance

Difference (vs. Hamming Distance)Evaluation Window, Ew = 600Suspect Threshold: S = 1-6/600=99%Repair Threshold: R = 1-4/600 = 99.3%Re-introduction rate: r = 0.1

ParametersParameters:

Repairs evolvedRepairs evolved in-situ, in real-time, without additional test in-situ, in real-time, without additional test vectors, vectors, while allowing device to remainwhile allowing device to remain partially online. partially online.

Page 38: 06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central

Isolation of a single faulty individual with 1-out-of-64 impact

• Outliers are identified after W iterations elapsed• E.V. = (1/64)*600 = 9.375 from minimum impact faulty individual• Isolated individual’s f differs from the average DV by 33 after 1 or more observation intervals of length W