View
215
Download
3
Tags:
Embed Size (px)
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
StageNet: A Reconfigurable CMP Fabric for Resilient Systems
Shantanu GuptaShuguang Feng
Jason BlomeScott Mahlke
2nd Workshop on Reconfigurable and Adaptable Architecture
Dec 1, 2007
2 University of MichiganElectrical Engineering and Computer Science
Reliability Challenge • Increasing defect rates is a major challenge [ITRS’03]
• ↑ power density ↓ feature sizes ↑ failures in time (FIT)
• Permanent faults► Manufacturing defects► Time dependent dioxide breakdown (TDDB)► Negative bias threshold inversion (NBTI)► Electromigration (EM)► ….
[Srinivasan, DSN‘04]
For 32nm technology node, an 8 core CMP would face ~30 faults in 4 years
3 University of MichiganElectrical Engineering and Computer Science
• Traditional solutions
► TMR► Tandem / HP Non-stop► Impractical for mainstream
• Cost• Power• Low gain
Tolerating Permanent Faults
• Current approaches
► Detection/Prediction• Using sensors
• Analytical models
• Redundant execution
• BIST
► Repair• Replacement
• Reconfiguration
K-pos DP-31/32Teramac (1995)
4 University of MichiganElectrical Engineering and Computer Science
Lower design complexityLower overheads
Reconfiguration Granularity
FETCH
DEC
EXEC
WB
MEM
CORE level
• Range of choices for the reconfiguration granularity
STAGE level MODULE level
- ElastIC, DT’ 06- Reunion, MICRO’06- Configurable Isolation, ISCA’07
- Online Diagnosis of Hard Faults, MICRO’ 05- Ultra Low-Cost Defect Protection, ASPLOS’ 06
Better resource utilization
5 University of MichiganElectrical Engineering and Computer Science
Mean Time to Failure Comparison
Area increase (%)
MT
TF
incr
ease
(%
)
MODULE level
STAGE level
CORE level
CORE level
+ Easiest to do in practice
-- Poorest MTTF gains
STAGE level
+ Circuit/logical boundary
+ Improved MTTF gains
-- Architectural complexity
MODULE level
+ Best MTTF gains
-- Hardest to repair
6 University of MichiganElectrical Engineering and Computer Science
Throughput Comparison
STAGE level
CORE level
STAGE level reconfiguration allow significantly more graceful throughput degradation
• Monte-Carlo study
• Randomly injected failures
• Assumes that stages are shared resources
7 University of MichiganElectrical Engineering and Computer Science
Goal of this Research
• Design a computing substrate► Fault tolerant► Graceful performance degradation with defects► Highly reconfigurable► Adaptable to the workload
Design that can meet the challenge of facing ~ 100s of faults while maintaining 70-80% throughput
8 University of MichiganElectrical Engineering and Computer Science
Core 2
Core 0 Core 1
Core 3
CMP Fabric
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
9 University of MichiganElectrical Engineering and Computer Science
StageNet CMP Fabric
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Configuration Manager
Allocator
Logical pipeline
10 University of MichiganElectrical Engineering and Computer Science
StageNet CMP Fabric - Benefits
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Configuration Manager
11 University of MichiganElectrical Engineering and Computer Science
StageNet CMP Fabric - Issues
Allocator
• Performance / Efficiency► Scaling with number of stages► Impact of router delay
• Transmission delay (tdelay)• Congestion delay
• Design overheads► Area► Power
• Micro-architectural concerns► Data forwarding logic► Control flow handling
256 bits
64
12 University of MichiganElectrical Engineering and Computer Science
Experimental Setup
MiBench suite
SimpleScalar 4.0- No. of instructions- No. of cycles- Branch mis-predicts- I/D cache misses….
StageNet Model
CPIResults
Simulates an in-order corewith default parameters
Stores statistics for the benchmarks
Parameterizable performance model for StageNet
13 University of MichiganElectrical Engineering and Computer Science
Effect of varying pipeline depthtdelay 1
14 University of MichiganElectrical Engineering and Computer Science
Effect of varying transmission delaystages 10
15 University of MichiganElectrical Engineering and Computer Science
• Router delay is the leading cause for the slowdown• Need some way to improve system utilization
• Let us send macro-ops (MOP)► MOP is an instruction bundle
• Upper bound on length
• Upper bound on live-ins / live-outs
• No branches in between
► Advantages• Amortizes delay / contention
• Increases resource utilization
Performance enhancement
Max length 4Max live-ins 2
>>
ST
LD
+
/
>>
&
<<
ST
+
LD
16 University of MichiganElectrical Engineering and Computer Science
Effect of varying MOP sizetdelay 4stages 10
17 University of MichiganElectrical Engineering and Computer Science
Conclusions• Reliability aware architectures with a finer grained reconfiguration
are desirable for:► Better MTTF gains► Graceful throughput degradation
• StageNet, a potential solution, allows stage level reconfiguration and is:
► Easy to reconfigure► Inherently redundant► Potentially scalable issue width
• Using StageNet, significant reconfiguration flexibility can be traded with a small loss in performance
18 University of MichiganElectrical Engineering and Computer Science
Future Work• Micro-architectural issues
► Data bypass handling► Control flow handling► Sharing state between pipeline stages
• Network design► Design of routers► Design of interconnection
• Simulation setup► Validation of results using a cycle accurate simulator
19 University of MichiganElectrical Engineering and Computer Science
StageNet: A Reconfigurable CMP Fabric for Resilient Systems
21 University of MichiganElectrical Engineering and Computer Science
Repair
ElastICDT’06
H.Qin, UC Berkeley
DEC
OD
ER
CHECKER(majority)
ID/EX
BISTTest Vectors
CHECKER(majority)
IF/ID
BISTTest Vectors
DEC
OD
ERD
ECO
DER
Ultra Low-Cost Defect Protection for Microprocessor Pipelines, ASPLOS’ 06
F. Bower, Tolerating Hard Faults in Microprocessor Array Structures, DSN’ 04