31
Shuchang Shan † ‡ , Yu Hu , Xiaowei Li Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences Graduate University of Chinese Academy of Sciences (GUCAS) Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

  • Upload
    netis

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors. Shuchang Shan † ‡ , Yu Hu † , Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences - PowerPoint PPT Presentation

Citation preview

Page 1: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Shuchang Shan † ‡ , Yu Hu †, Xiaowei Li †

†Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences

‡ Graduate University of Chinese Academy of Sciences (GUCAS)

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Page 2: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

2

Outline

Introduction

TDB execution model

Experimental results

Conclusion

Page 3: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

3

FUs

Decode/

Rename

Register File

Writeback/

Commit

Fetch

Reorder Buffer

Issue QueueFUs

Decode/

Rename

Register File

Writeback/

Commit

Fetch

Reorder Buffer

Issue Queue

=

Architectural level Dual Modular Redundancy

Memory system

L1 L1 L1 L1L1

Instruction-level DMR

Core-level DMR

AR-SMT[FTCS’99], SRT[ISCA’00]Thread-level DMR

DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02]

CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]

Leading thread

Trailing thread

EX’

CHKLeading

instructionsTrailing

instructions

A A’ B B’ For CMP systems, to make use of abundant

hardware resources, buildingCore-level DMR!

Page 4: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

4Core-level Dual Modular Redundancy (DMR)Using coupled cores to verify each other’s executionStatic binding

– lacks of flexibility– e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03]

Dynamic binding– Lacks of scalability for parallel processing– e.g., DCC [DSN’07, WDDD’08]

-

On-chip network & Shared Cache

X - X

A A’ B B’

Static binding

C

On-chip network & Shared Cache

X X C’

A B’ B A’

Dynamic binding

Page 5: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

5

Key issue in Core-level DMR

Maintain master-slave memory consistencyMaster-slave memory consistency

– Coupled cores must get the same memory value– External writes causes consistency violation

Reunion [Smolens-MICRO’06]– Rollback and recovery for the inconsistency

Dynamic Core Coupling (DCC) [LaFrieda-DSN’07]– Consistency window to stall the external writes

Scalability problem

LD1

ST3

LD1'

LD2Consistency

violation

LD1

ST3

LD1'

LD2

ST3Stall latency

Page 6: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

6

Scalability problemExternal writes occur earlier and more frequently as the

system scales– Reunion: Unacceptable recovery overhead for consistency violation– DCC: Unacceptable stall latency caused by consistency window

Scalable solution needed– Reduce the consistency maintenance overhead

1684 1684 1684 1684 1684 1684 1684 1684lu fft ocean-con barnes cholesky radix radiosity average

0.00.10.20.30.40.50.60.70.80.91.0

exte

rnal

write

inte

rval

brea

kdow

n <100 <200 <300 <500 >500

Probability of external writes occurring within certain slacks

For 4-CMP system: 28% in 100 cycles 37% in 500 cycles

For 16-CMP system: 43% in 100 cycles 55% in 500 cycles

cycles

#External writes within 1K cycles: 0.3 for 4-CMP 3.3 for 16-CMP

Page 7: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

7

Basic ideathe scope of the master-slave memory consistency maintenanceSphere of Consistency (SoC)

– The memory hierarchy– The private caches

Master

L1 cache L1 cache

Slave

Global memory

Master

L1 cache L1 cache

Slave

Global memory

Transparent Dynamic Binding (TDB):Reduce the SoC to the scale of private caches;

provide scalable and flexible Core-level DMR solution!

Page 8: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

8

Outline

Introduction

TDB execution model

Experimental results

Conclusion

Page 9: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

9

TDB principle

The same program input for the pairSimilar memory access behavior

Program

A-L1$ A’-L1$

Global memory

Transparent binding: Master issues L1 miss requests for the logical pair Slave is prevent from accessing the global memory

Dynamic binding: using the system network fordata communication and result comparison

Page 10: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

10

Transparent dynamic binding

Master

Global memory

Slave

Program Logical pair: Consumer-consumer

Sphere of ConsistencyThe private caches

Transparent of slavesPassively waiting

Consumer-consumer data access pattern

Producer

Page 11: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

11

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects [1]:

Master

Global memory

Slave

Program

Producer

MA1

1 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

LRU MRU

[1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

Page 12: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

12

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2

LRU MRU

Page 13: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

13

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2 3

4

LRU MRU

Pipeline Refresh

Page 14: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

14

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2 3 4

MRULRU

5

Page 15: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

15

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

2 3 4 3

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

1 4

MRULRU

55

Master-slave private cache consistency violation

Invariant: in-order memory instruction retirement sequence

Page 16: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

16

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

Global memory

Victim Buffer:Filter the WP data blocks

Page 17: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

17

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 2

Global memory

Page 18: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

18

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

Page 19: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

19

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

5 5Conservative private cache ingress rule:

accept data blocks from correct path into private caches

Page 20: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

20

Master Slave

ProgramMA1

1 5 5

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

MA1MA5

Invariant: in-order memory instruction retirement sequence

Maintain Consistency under Out-of-Order Execution

Potential master-slave consistency violation

Page 21: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

21

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

Global memory

MA1MA5

Page 22: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

22

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

MA1MA5

5 5uar-LRU: update MRU after the instruction retirement to prevent the WP

memory references from violating the consistency

Page 23: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

23

Master-slave memory consistency violationExternal writes violates the master-slave memory consistencyAtomicity of master-slave data access behaviorLacks of scalability as external writes become more frequent

Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

Page 24: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

24Transparent Input Coherence StrategyTake advantage of Transparent dynamic bindingBreak the atomicity of master-slave data access behavior

LD1

ST3

LD1'

ST3

D D

I D

I D

optimization

time

Checker

Page 25: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

25

Outline

Introduction

TDB execution model

Experimental results

Conclusion

Page 26: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

26

Experimental Setup

Full system simulator: simics + GEMSParallel workloads: SPLASH-2The Baseline Dual Modular Redundancy System

– N active cores and another N disabled cores– Simulate the DMR system where the slaves work without

interfering the masters

Page 27: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

27

The Performance of TDB Proposal

0.8

0.9

1.0

1.1

Norm

alize

d run

time

4P 8P 16P 32P

97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively

Conservative private cache ingress rule helps filter the WP effects

Page 28: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

28

Network Traffic of TDB Proposal

0.8

0.9

1.0

1.1

Norm

alize

d Net

work

Traffi

c

4P 8P 16P 32P

the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

Page 29: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

29

Comparison against DCC [DSN’07]

4P 8P 16P 32P1.01.21.41.6

4P 8P 16P 32P1.0

1.1

TDB DCCNo

rmali

zed

Runti

me

Norm

alize

d Ne

twor

k Tra

ffic

TDB DCC

9.2% 10.4%18%

37.1%

Transparent Dynamic Binding (TDB):scalable and flexible Core-level DMR solution!

Page 30: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

30

Conclusion

Transparent Dynamic Binding– Reduce SoC to the scale of Private Caches

Techniques to maintain the consistency– Consumer-consumer data access pattern– Victim Buffer assisted conservative ingress rule– uar-LRU replacement policy– Transparent input coherence policy

Scalable and flexible core-level DMR solution

Page 31: Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

31

Q&A?