19
Dependable System Design For Reconfigurable Safety-Critical Applications PhD Student: Anees Ullah Research Group: CAD group Advisor: prof. Luca Sterpone 1

Dependable System Design For Reconfigurable Safety ... ULLAH_presentation.pdf · Dependable System Design For Reconfigurable Safety-Critical Applications ... ”, in 2013 IEEE Convention

Embed Size (px)

Citation preview

Dependable System Design For Reconfigurable Safety-Critical

Applications PhD Student: Anees Ullah Research Group: CAD group

Advisor: prof. Luca Sterpone

1

Outline

• Dependability

• Re-configurability

• FPGA architecture and Slice Internals

• Problem Statement and goal

• Fine-grain Self-Repairing System (FGSR)

• Dynamically Reconfigurable TMR (DrTMR)

• ASIC fault emulation (FE)

• Conclusions

• Future Work

2

Dependability

• Reliable computations in today’s Nano-metric technology is challenging – Manufacturing defects (stuck-at, stuck-open,

bridge faults) – In-operational effects (radiations , electro-

migration, aging)

• Dependable system design techniques should be applied from RTL to implementation phases

• These mitigation strategies are assessed with testing and fault injection techniques to ensure the desired reliability levels are attained

3

Re-configurability

• FPGAs are electronic devices that can upgrade and change its functionality in-field

• Redundancy for error masking/detection combined with re-configuration for correction is used in critical system design

• Re-configurability can also be used for fault injection/emulation as a tool of dependability assessment

• SRAM-based FPGAs are highly sensitive to Soft Errors and have huge recovery times

4

• Rows (Clock Regions)

• Major Column

IOB

co

lum

n

BR

AM

co

lum

n

DS

P c

olu

mn

CM

T c

olu

mn

Clock Region

DS

P c

olu

mn

BR

AM

co

lum

n

IOB

co

lum

n

SRAM-based FPGAs: architecture & Reconfiguration

5

• Minor Column

Fram

e 0

Fram

e 35

Fram

e 1

Fram

e 25

CLB Major Column

20

CLB

Hig

h

• SLICE

6

1

0

PRECYINIT

1

0

1

0

1

0

AOUTMUX

BOUTMUX

COUTMUX

DOUTMUX

ACY0

BCY0

CCY0

DCY0

O5 AX

O5 BX

O5 CX

O5 X

O6

O6

O6

O6

LUT A

LUT B

LUT C

LUT D

BX

A

B

C

D

CX

DX

AX

A1

A6

.

.

A1

A6

.

.

A1

A6

.

.

A1

A6

.

.

M1

M2

M3

M4

X1

X2

X3

X4

COUT

AMUX

BMUX

CMUX

DMUX

FPGA’s Slice Architecture

Problem statement and Goal • To mitigate the effects of Multiple Bit Upsets

(MBUs) in the configuration memory and reduce the recovery times

• Fine-grain redundancy offers better mitigation properties, precise diagnosis of faults and fast error detection while fine-grain reconfiguration improves the recovery times

• The challenge is the overhead introduced by fine-grain approaches and the lack of design tools and methodologies

• The goal of this work is to exploit the FPGA’s primitive elements (LUT, carry-chains) to support fine-grain approaches to redundancy and reconfiguration

7

Fine-grain self-repairing (FGSR) • Self-repairing systems contain a Static and a

Dynamic region implemented on a reconfigurable platform

• The Static region consists of resources (processors, memories, I/O interfaces) that are always required for the operation of the system

• The Dynamic region contains reconfigurable modules that can change functionality during operational phase of the system called Partial Reconfigurable Modules (PRMs)

• The research aim is to increase the fault tolerance capabilities for PRMs with reliable and fast error-detectors having a low area-overhead

8

Matteo Sonza Reorda, Luca Sterpone and Anees Ullah, “An Error-Detection and Self-Repairing Method for Dynamically and Partially Reconfigurable Systems”, in IEEE European Testing Symposium (ETS 2013), May 27-31, 2013 (extended version submitted to IEEE Transaction on Computers )

FGSR - Methodology

9

DUT HDL

SYNTHESIS

(XST)

Net-list Extraction

CONSTRAINT GENERATOR

TRANSLATE & MAP

MAPPED NCD

XDL

XDL’ MAPPED NCD’

P & R PAR

BITSTREAM BITGEN

XDL MANI- -PULATOR

XDL2NCD

NCD2XDL

Regions Formation

Dynamic Region

Stat

ic R

egio

n Single Bit Error Region

Multiple Bit Error Region

LUT-based Error Detection

Carry-ch

ains d

etectors

Sin

gle

Bit

Err

or

Reg

ion

CLB

Co

lum

n

Column Logic Delay: ~ 23 ns

Column Routing Delay ~ 36 ns

OR LUT Tree

SLICE

SLICE

FLAG 1

Mu

ltip

le B

it E

rro

r R

egio

n C

LB C

olu

mn

Logic Delay: < 1 ns

Routing Delay:< 3 ns

Error A1

A3 A2

A4 A5 A6

0 1

A1

A3 A2

A4 A5

A6

1 0

LUT A

1 0

LUT B

DX

Multiple bit Error Detection

A1

A3 A2

A4 A5 A6

0 1

A1

A3 A2

A4 A5 A6

1

0

LUT A

1 0

LUT B

DX

Error

1

0

Single bit Error Detection

FGSR - Results

10

0

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

9.000

10.000

Nu

mb

er

of

up

sets

SEUs Wrong Answer (#)

SEUs Corrected (#)

MEUs Wrong Answer (#)

MEUs Corrected (#)

0,00

5.000,00

10.000,00

15.000,00

20.000,00

25.000,00

Re

cove

ry T

ime

in M

icro

seco

nd

s

DMR [µs]

TMR [µs]

Our Approach [µs]

0

5

10

15

20

25

30

35

Clo

ck P

eri

on

d in

Nan

ose

con

ds

DMR (ns)

TMR (ns)

Our approach (ns)

0

2.000

4.000

6.000

8.000

10.000

12.000

14.000

16.000

18.000

20.000

Nu

mb

er

of

LUTs

DMR (# LUTs)

TMR (# LUTs)

Our approach(# LUTs)

Dynamically Reconfigurable TMR (DrTMR)

11

Mixed placement of TMR domain increases the probability of Cross Domain Errors and makes partial Reconfiguration of TMR domains impossible

1. “On the Optimal Reconfiguration times of TMR circuits on SRAM based FPGAs”, in IEEE NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2013), June 25-27, 2013

2. Real-Time SEU Tolerant Circuits on SRAM-based FPGAs”, in 2013 IEEE Convention on Radiation Effects on Components and Systems (RADECS 2013) , Sep 23-27, 2013

3. “Recovery Time and Fault Tolerance Improvements for Circuits mapped on SRAM based FPGAs”, Journal of Electronic Testing: Theory and Applications (JETTA)

CDE-aware Voter placement - Voters from same Domain in a Slice Critical-Voter placement - Voters from different Domain in a Slice

TMR (X-TMR)

Hierarchy Formation

Constraint Placement

(Xilinx Placer)

Error Detection and

Flag Convergence

Floor-planning

(PlanAhead)

EDIF

EDIF

UCF

NCD

ROUTING

(Xilinx PAR)

Bitstream Generation

(Xilinx BITGEN)

DrTMR - Methodology

12

MAJ_V0 MIN_V0

Sig_D0

Sig_D1

Sig_D2

MAJ_V1

MAJ_V2

Primary MAJ_V0

Sig_D0 Sig_D1

Sig_D2

MIN_V0

Primary Sig_D0 Sig_D1 Sig_D2

MAJ_V0 Sig_D0

Sig_D1

Sig_D2

A1

A4

A2

A5

0 1

A3 A2

A4 A5

1

0

XNOR 1

1

0

XNOR 2

DX

SIG1

SIG1_MAJJ

SIG2_MAJJ

Primary Primary

DrTMR - Results

13

1

10

100

1000

10000

100000

Re

cove

ry T

ime

in M

icro

seco

nd

s

XNOR TMR (us)

Min-TMR (us)

Xilinx-TMR (us)

1

10

100

1000

10000

RIS

C5

x

CO

RD

ICR

2P

CO

RD

ICP

2R

MIN

IMIP

S

LEO

N II

L80

SOC

HP

16

RS_

DEC

FPU

DC

T

Nu

mb

er

of

Cro

ss-D

om

ain

Err

ors

XTMR Routing (# CDEs)

Min-TMR Routing (#CDEs)

XNOR-TMR Routing (#CDEs)

0,001

0,01

0,1

1

10

100

LUTs

(in

Th

ou

san

ds)

Xilinx TMR (# LUTs)

Min- TMR (# LUTs)

XNOR - TMR (# LUTs)

0

5

10

15

20

25

Clo

ck P

eri

od

in N

ano

seco

nd

s

X-TMR (ns)

XNOR-TMR (ns)

Min- TMR (ns)

Error Latency = 28 ns @ 192 CLBs Zero CDEs in Logic

1 LUT per Majority Voter

ASIC Fault Emulation (FE) • The simulation of a circuit in the presence of

faults is called fault simulation and it is a mandatory step to determine the yield of any VLSI fabrication process

• Fault Emulation is the hardware realization of fault simulation process in order to achieve speed

• The research aim of this activity is to exploit the flexibility of LUTs for fault injection purposes; emulating the ASIC permanent faults on FPGA

14

“Effective emulation of permanent faults in ASICs through Dynamically Reconfigurable FPGAs ”, in 24th IEEE International Conference on Field Programmable Logic and Applications Munich, Germany; September 2 - 4, 2014

FE - Methodology

15

Net-list Parsing

Duplication-free Map

Xilinx tools

EDIF

ASIC Logic Cones

Address lines bind

Fault Dictionary Algorithm

Physical Information

ASIC Fault List Extractor

ASIC Collapsed

fault list

Fault dictionary

Custom Mapping Phase

Fault Dictionary Creation Phase

ASIC Verilog Netlist

TetraMax

ASIC Verilog Netlist

A1

A3

A2

A4 A5

A6

LUT A

a b

c

d

e

f

g

h

i O6

I0

I1

I2

I3

I4

I5 j

O6= ( ( (I0 and I1)’ and I5)’ and (I2 and (I3 or I4))’))’

O6= ( ( (I0 and I1)’ and I5)’ and (I2 and (1))’)’

(I3 or I3’) and (I4 or I4’)

“f” @ 1

FE - Results

16

0,001

0,01

0,1

1

10

100

1000

10000

100000

1000000

10000000

Ad

der6

4

Mu

ltiplie

r32

Mu

ltiplie

r64

FIR filter

Hilb

ert transfo

rm

Fau

lt S

imu

lati

on

Tim

es

on

Lo

gSca

le

ASIC Tmodelsim(s)

FPGA Xilinx Mapper Tmodelsim (s)

FPGA Custom Mapper Tmodelsim (s)

FPGA Custom Mapper Tfpga (s)

0

2000

4000

6000

8000

10000

12000

Nu

mb

er

of

LUTs

Xilinx Mapper (# LUTs)

Custom Mapper (# LUTs)

0

20

40

60

80

100

120

140

De

lay

in N

ano

seco

nd

s

Xilinx mapper (Delay ns)

Custom mapper (Delay ns)

No re-compilation

Guaranteed emulation

Conclusions

• The fine-grain approaches to reconfiguration and redundancy simultaneously improves the fault tolerance and reconfiguration times

• These improvements are achieved at an affordable area and clock period overhead

17

Future Work

• As future a work, radiation testing of the developed mitigation techniques need to be conducted

• The developed CAD algorithms can be be optimized with respect to clock period, area and reconfiguration times

18

19