28
Illinois-Intel Parallelism Center DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

  • Upload
    truong

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. Hyojin Sung , Rakesh Komuravelli , and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Shared memory is de-facto model for multicore SW and HW BUT … - PowerPoint PPT Presentation

Citation preview

Page 1: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

Hyojin Sung, Rakesh Komuravelli, and Sarita V. AdveDepartment of Computer Science

University of Illinois at Urbana-Champaign

Page 2: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Motivation• Shared memory is de-facto model for multicore SW and HW• BUT …– Complex SW: data races, unstructured parallelism, memory model, …– Inefficient HW: complex coherence/consistency, unnecessary traffic, …

• Recent work on disciplined shared memory– SW: Easier programming model– HW: If SW is more disciplined, can we build more efficient HW?• DeNovo: Holistic rethinking of entire memory hierarchy

Page 3: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Disciplined Shared Memory

Disciplined Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronizationExplicit, structured side-effects

Page 4: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Disciplined Shared Memory

Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics

DeNovo – performance, complexity and power efficient • Simplify coherence and consistency

OOPSLA ‘09

PACT ‘11

Disciplined Shared Memoryexplicit effects structured parallel control

Page 5: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Limitation• DeNovo for deterministic programs– Important assumptions

1. No conflicting concurrent accesses, only barrier synchronization2. Known side-effects

– Allowed DeNovo to eliminate design complexity and inefficiency

• Challenges for nondeterministic programs – The assumptions do not hold any more

1. Can have conflicting concurrent accesses, support lock synchronization

2. Side-effects unknown in critical sections– Applications with lock-based non-determinism are common

Page 6: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics •

Contribution

DeNovoND: Non-deterministic codes with benefits of DeNovo• Minimal additional HW for non-determinism• Comparable performance to MESI• 30% lower network traffic than MESI• PLUS all advantages of DeNovo for deterministic codes

Explicit & safe non-determinism POPL ‘11

Disciplined Shared Memoryexplicit effects structured parallel control

Page 7: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Outline

• Motivation• Background– DPJ/DeNovo for deterministic codes– DPJ support for disciplined non-determinism

• DeNovoND Design • DeNovoND Implementation• Evaluation• Conclusion and Future Work

Page 8: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DPJ for Deterministic Codes...

...

heap

• Structured parallel control– Fork-join parallelism

• Explicit region and effect– Regions divide heap– Read or write effects on regions

• Data-race freedom guarantee– Simple, modular type checking write

effect

ST ST ST ST

LD

Page 9: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DPJ for Deterministic Codes...

...

heap

• Java-compatible type system• Structured parallel control– Fork-join parallelism

• Explicit region and effect– Regions divide heap– Read or write effects on regions

• Data-race freedom guarantee– Simple, modular type checking

writeeffect

ST ST ST ST

LD

Hardware – simplify coherence problems!

Page 10: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DeNovo for Deterministic Codes• Coherence Enforcement

1. Invalidate stale copies in private cache2. Track up-to-date copy

• Explicit effects– Compiler knows all writeable regions in this parallel phase– Cache can self-invalidate before next parallel phase

• Registration– Directory keeps track of one up-to-date copy– Writer registers itself before next parallel phase

Page 11: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DeNovo for Deterministic Codes

• No space overhead– Keep valid data or registered core id– LLC data arrays double as directory

• No transient states• No invalidation traffic• No false sharing

registry

Invalid Valid

Registered

Read

Write Write

Page 12: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

V X R Y

Example Run

ST ST

V X V Y

V X V Y

R C1 R C2

..

X in DeNovo-region Y in DeNovo-region

self-invalidate( )

L1 of Core 1 L1 of Core 2

Shared L2

RegisteredValidInvalid

V X V Y

R X V Y

Registration Registration

Ack Ack

R X I Y I X R Y

Page 13: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DPJ Support for Safe Non-Determinism

• Nondeterminism comes from conflicting concurrent accesses

• Isolate these accesses as “atomic”– Enclosed in “atomic” sections– “Atomic” regions and effects

→ “Disciplined” non-determinism- Race freedom, strong isolation- Determinism-by-default semantics

• DeNovoND converts “atomic” statements into locks

...

...

ST

LD

Page 14: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Outline• Motivation• Background• DeNovoND Design– Memory Consistency Model– Distributed Queue-based Lock

• DeNovoND Implementation• Evaluation• Conclusion and Future Work

Page 15: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

...

..

Memory Consistency Model• Deterministic accesses

1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

ParallelPhase

ST 0xaDeNovoCoherenceMechanism

Page 16: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

...

..

Memory Consistency Model• Non-deterministic accesses

1. Same task in this parallel phase2. Or before this parallel phase3. Or in preceding critical sections

LD 0xa

ST 0xaST 0xa

CriticalSection

ParallelPhase

Page 17: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Coherence for non-deterministic data

• When to invalidate? – Between the start of critical section and any read

• What to invalidate?– Entire cache? regions with “atomic” effect?– Track atomic writes in a signature, transfer with lock

• Registration– Writer updates before next critical section

• Coherence Enforcement1. Invalidate stale copies in private cache2. Track up-to-date copy

Page 18: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Distributed Queue-based Lock• Lock primitive that works on DeNovoND– No directory, no write invalidation No spinning for lock

• Modeled after QOSB Lock– Lock requests form a distributed queue– But much simpler

• Details in the paper

Page 19: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Outline• Motivation• Background• DeNovoND Design• DeNovoND Implementation• Evaluation• Conclusion and Future Work

Page 20: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Access Signatures

• Simple and small hardware Bloom filter per core– Track accesses with “atomic” effects only– Only 256 bits suffice

• Operations on Bloom filter – On write: insert address– On read: query filter for address for self-invalidation

Page 21: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

lock transfer

V X R YV Z V W

V X R YI Z V W

V X R YV Z V W

R X V YV Z V WR X V YR Z V W

R C1 R C2V Z V W

R C1 R C2R C1 V W

lock transfer

Example Run

ST LDST

..

self-invalidate( )

L1 of Core 1 L1 of Core 2

Shared L2

Z W

Registration

Ack

Read miss

X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region

LD

V X R YV Z R W

R X V YR Z I W

Registration

R C1 R C2R C1 R C2

Ack

Read miss

R X V YR Z V W

self-invalidate( )reset filter

R X V YR Z I W

V X R YI Z R W

Z W

Page 22: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Optimization to reduce self-invalidation

1. loads in Registered state2. “Touched-atomic” bit– Set on first atomic load– Subsequent load don’t self-invalidate

• More in the paper

..

X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region

STLD

self-invalidate( )

LDLD

Page 23: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Overheads

• Hardware Bloom filter– 256 bits per core

• Storage overhead– One additional state, but no storage overhead (2 bits) – “Touched-atomic” bit per word in L1

• Communication overhead– Bloom filter piggybacked on lock transfer message– Writeback messages for locks • Lock writebacks carry more info

Page 24: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Evaluation Methodology

• Simulator: Simics + GEMS + Garnet• System Parameters– 16 in-order cores

• Workloads– SPLASH-2, PARSEC and STAMP– Unchanged except region/effect and self-invalidation

• Protocols– MESI and DeNovoND– With idealized locks and realistic locks

Page 25: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

MESI vs. DeNovoND: Idealized lock

• DeNovoND performs comparable to MESI for all apps– For both DIL-INF and DIL-256

barnes ocean water fluidanimate streamcluster tsp kmeans ssca2

Page 26: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

MESI vs. DeNovoND: Realistic lock

• pthread lock vs. distributed queue-based lock• DeNovoND performs comparable or better than MESI

barnes ocean water fluidanimate streamcluster tsp kmeans ssca2

Page 27: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Network Traffic (Realistic lock)

• DeNovoND has 33% less traffic than MESI (67% max)– No invalidation traffic– Reduced load misses due to lack of false sharing

barnes ocean water fluidanimate streamcluster tsp kmeans ssca2

Page 28: DeNovoND: Efficient Hardware Support for  Disciplined Non-Determinism

Illinois-Intel Parallelism Center

Conclusions and Future Work

• DeNovoND: Efficient HW support for non-determinism– Minimal additional HW for safe non-determinism– Comparable performance to MESI– 30% lower network traffic than MESI– PLUS all advantages of DeNovo for deterministic codes

• Future work: broaden the application space further– Pipeline parallelism, “lock-free” data structures, OS, legacy

codes…