21
1 University of Michigan Electrical Engineering and Computer Science StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang Feng Jason Blome Scott Mahlke 2 nd Workshop on Reconfigurable and Adaptable Architecture Dec 1, 2007

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

1 University of MichiganElectrical Engineering and Computer Science

StageNet: A Reconfigurable CMP Fabric for Resilient Systems

Shantanu GuptaShuguang Feng

Jason BlomeScott Mahlke

2nd Workshop on Reconfigurable and Adaptable Architecture

Dec 1, 2007

2 University of MichiganElectrical Engineering and Computer Science

Reliability Challenge • Increasing defect rates is a major challenge [ITRS’03]

• ↑ power density ↓ feature sizes ↑ failures in time (FIT)

• Permanent faults► Manufacturing defects► Time dependent dioxide breakdown (TDDB)► Negative bias threshold inversion (NBTI)► Electromigration (EM)► ….

[Srinivasan, DSN‘04]

For 32nm technology node, an 8 core CMP would face ~30 faults in 4 years

3 University of MichiganElectrical Engineering and Computer Science

• Traditional solutions

► TMR► Tandem / HP Non-stop► Impractical for mainstream

• Cost• Power• Low gain

Tolerating Permanent Faults

• Current approaches

► Detection/Prediction• Using sensors

• Analytical models

• Redundant execution

• BIST

► Repair• Replacement

• Reconfiguration

K-pos DP-31/32Teramac (1995)

4 University of MichiganElectrical Engineering and Computer Science

Lower design complexityLower overheads

Reconfiguration Granularity

FETCH

DEC

EXEC

WB

MEM

CORE level

• Range of choices for the reconfiguration granularity

STAGE level MODULE level

- ElastIC, DT’ 06- Reunion, MICRO’06- Configurable Isolation, ISCA’07

- Online Diagnosis of Hard Faults, MICRO’ 05- Ultra Low-Cost Defect Protection, ASPLOS’ 06

Better resource utilization

5 University of MichiganElectrical Engineering and Computer Science

Mean Time to Failure Comparison

Area increase (%)

MT

TF

incr

ease

(%

)

MODULE level

STAGE level

CORE level

CORE level

+ Easiest to do in practice

-- Poorest MTTF gains

STAGE level

+ Circuit/logical boundary

+ Improved MTTF gains

-- Architectural complexity

MODULE level

+ Best MTTF gains

-- Hardest to repair

6 University of MichiganElectrical Engineering and Computer Science

Throughput Comparison

STAGE level

CORE level

STAGE level reconfiguration allow significantly more graceful throughput degradation

• Monte-Carlo study

• Randomly injected failures

• Assumes that stages are shared resources

7 University of MichiganElectrical Engineering and Computer Science

Goal of this Research

• Design a computing substrate► Fault tolerant► Graceful performance degradation with defects► Highly reconfigurable► Adaptable to the workload

Design that can meet the challenge of facing ~ 100s of faults while maintaining 70-80% throughput

8 University of MichiganElectrical Engineering and Computer Science

Core 2

Core 0 Core 1

Core 3

CMP Fabric

Stage1

StageN

Stage2

Stage3

Stage1

StageN

Stage2

Stage3

Stage1

StageN

Stage2

Stage3

Stage1

StageN

Stage2

Stage3

9 University of MichiganElectrical Engineering and Computer Science

StageNet CMP Fabric

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Configuration Manager

Allocator

Logical pipeline

10 University of MichiganElectrical Engineering and Computer Science

StageNet CMP Fabric - Benefits

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Stage1 StageNStage2 Stage3

Configuration Manager

11 University of MichiganElectrical Engineering and Computer Science

StageNet CMP Fabric - Issues

Allocator

• Performance / Efficiency► Scaling with number of stages► Impact of router delay

• Transmission delay (tdelay)• Congestion delay

• Design overheads► Area► Power

• Micro-architectural concerns► Data forwarding logic► Control flow handling

256 bits

64

12 University of MichiganElectrical Engineering and Computer Science

Experimental Setup

MiBench suite

SimpleScalar 4.0- No. of instructions- No. of cycles- Branch mis-predicts- I/D cache misses….

StageNet Model

CPIResults

Simulates an in-order corewith default parameters

Stores statistics for the benchmarks

Parameterizable performance model for StageNet

13 University of MichiganElectrical Engineering and Computer Science

Effect of varying pipeline depthtdelay 1

14 University of MichiganElectrical Engineering and Computer Science

Effect of varying transmission delaystages 10

15 University of MichiganElectrical Engineering and Computer Science

• Router delay is the leading cause for the slowdown• Need some way to improve system utilization

• Let us send macro-ops (MOP)► MOP is an instruction bundle

• Upper bound on length

• Upper bound on live-ins / live-outs

• No branches in between

► Advantages• Amortizes delay / contention

• Increases resource utilization

Performance enhancement

Max length 4Max live-ins 2

>>

ST

LD

+

/

>>

&

<<

ST

+

LD

16 University of MichiganElectrical Engineering and Computer Science

Effect of varying MOP sizetdelay 4stages 10

17 University of MichiganElectrical Engineering and Computer Science

Conclusions• Reliability aware architectures with a finer grained reconfiguration

are desirable for:► Better MTTF gains► Graceful throughput degradation

• StageNet, a potential solution, allows stage level reconfiguration and is:

► Easy to reconfigure► Inherently redundant► Potentially scalable issue width

• Using StageNet, significant reconfiguration flexibility can be traded with a small loss in performance

18 University of MichiganElectrical Engineering and Computer Science

Future Work• Micro-architectural issues

► Data bypass handling► Control flow handling► Sharing state between pipeline stages

• Network design► Design of routers► Design of interconnection

• Simulation setup► Validation of results using a cycle accurate simulator

19 University of MichiganElectrical Engineering and Computer Science

StageNet: A Reconfigurable CMP Fabric for Resilient Systems

20 University of MichiganElectrical Engineering and Computer Science

Back up slides

21 University of MichiganElectrical Engineering and Computer Science

Repair

ElastICDT’06

H.Qin, UC Berkeley

DEC

OD

ER

CHECKER(majority)

ID/EX

BISTTest Vectors

CHECKER(majority)

IF/ID

BISTTest Vectors

DEC

OD

ERD

ECO

DER

Ultra Low-Cost Defect Protection for Microprocessor Pipelines, ASPLOS’ 06

F. Bower, Tolerating Hard Faults in Microprocessor Array Structures, DSN’ 04