DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

Illinois-Intel Parallelism Center

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

Hyojin Sung, Rakesh Komuravelli, and Sarita V. AdveDepartment of Computer Science

University of Illinois at Urbana-Champaign


Motivation• Shared memory is de-facto model for multicore SW and HW• BUT …– Complex SW: data races, unstructured parallelism, memory model, …– Inefficient HW: complex coherence/consistency, unnecessary traffic, …

• Recent work on disciplined shared memory– SW: Easier programming model– HW: If SW is more disciplined, can we build more efficient HW?• DeNovo: Holistic rethinking of entire memory hierarchy


Disciplined Shared Memory

Disciplined Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronizationExplicit, structured side-effects


Disciplined Shared Memory

Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics

DeNovo – performance, complexity and power efficient • Simplify coherence and consistency

OOPSLA ‘09

PACT ‘11

Disciplined Shared Memoryexplicit effects structured parallel control


Limitation• DeNovo for deterministic programs– Important assumptions

1. No conflicting concurrent accesses, only barrier synchronization2. Known side-effects

– Allowed DeNovo to eliminate design complexity and inefficiency

• Challenges for nondeterministic programs – The assumptions do not hold any more

1. Can have conflicting concurrent accesses, support lock synchronization

2. Side-effects unknown in critical sections– Applications with lock-based non-determinism are common


Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics •

Contribution

DeNovoND: Non-deterministic codes with benefits of DeNovo• Minimal additional HW for non-determinism• Comparable performance to MESI• 30% lower network traffic than MESI• PLUS all advantages of DeNovo for deterministic codes

Explicit & safe non-determinism POPL ‘11

Disciplined Shared Memoryexplicit effects structured parallel control


Outline

• Motivation• Background– DPJ/DeNovo for deterministic codes– DPJ support for disciplined non-determinism

• DeNovoND Design • DeNovoND Implementation• Evaluation• Conclusion and Future Work


DPJ for Deterministic Codes...

...

heap

• Structured parallel control– Fork-join parallelism

• Explicit region and effect– Regions divide heap– Read or write effects on regions

• Data-race freedom guarantee– Simple, modular type checking write

effect

ST ST ST ST

LD


DPJ for Deterministic Codes...

...

heap

• Java-compatible type system• Structured parallel control– Fork-join parallelism

• Explicit region and effect– Regions divide heap– Read or write effects on regions

• Data-race freedom guarantee– Simple, modular type checking

writeeffect

ST ST ST ST

LD

Hardware – simplify coherence problems!


DeNovo for Deterministic Codes• Coherence Enforcement

1. Invalidate stale copies in private cache2. Track up-to-date copy

• Explicit effects– Compiler knows all writeable regions in this parallel phase– Cache can self-invalidate before next parallel phase

• Registration– Directory keeps track of one up-to-date copy– Writer registers itself before next parallel phase


DeNovo for Deterministic Codes

• No space overhead– Keep valid data or registered core id– LLC data arrays double as directory

• No transient states• No invalidation traffic• No false sharing

registry

Invalid Valid

Registered

Read

Write Write


V X R Y

Example Run

ST ST

V X V Y

V X V Y

R C1 R C2

..

X in DeNovo-region Y in DeNovo-region

self-invalidate( )

L1 of Core 1 L1 of Core 2

Shared L2

RegisteredValidInvalid

V X V Y

R X V Y

Registration Registration

Ack Ack

R X I Y I X R Y


DPJ Support for Safe Non-Determinism

• Nondeterminism comes from conflicting concurrent accesses

• Isolate these accesses as “atomic”– Enclosed in “atomic” sections– “Atomic” regions and effects

→ “Disciplined” non-determinism- Race freedom, strong isolation- Determinism-by-default semantics

• DeNovoND converts “atomic” statements into locks

...

...

ST

LD


Outline• Motivation• Background• DeNovoND Design– Memory Consistency Model– Distributed Queue-based Lock

• DeNovoND Implementation• Evaluation• Conclusion and Future Work


...

..

Memory Consistency Model• Deterministic accesses

1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

ParallelPhase

ST 0xaDeNovoCoherenceMechanism


...

..

Memory Consistency Model• Non-deterministic accesses

1. Same task in this parallel phase2. Or before this parallel phase3. Or in preceding critical sections

LD 0xa

ST 0xaST 0xa

CriticalSection

ParallelPhase


Coherence for non-deterministic data

• When to invalidate? – Between the start of critical section and any read

• What to invalidate?– Entire cache? regions with “atomic” effect?– Track atomic writes in a signature, transfer with lock

• Registration– Writer updates before next critical section

• Coherence Enforcement1. Invalidate stale copies in private cache2. Track up-to-date copy


Distributed Queue-based Lock• Lock primitive that works on DeNovoND– No directory, no write invalidation No spinning for lock

• Modeled after QOSB Lock– Lock requests form a distributed queue– But much simpler

• Details in the paper


Outline• Motivation• Background• DeNovoND Design• DeNovoND Implementation• Evaluation• Conclusion and Future Work


Access Signatures

• Simple and small hardware Bloom filter per core– Track accesses with “atomic” effects only– Only 256 bits suffice

• Operations on Bloom filter – On write: insert address– On read: query filter for address for self-invalidation


lock transfer

V X R YV Z V W

V X R YI Z V W

V X R YV Z V W

R X V YV Z V WR X V YR Z V W

R C1 R C2V Z V W

R C1 R C2R C1 V W

lock transfer

Example Run

ST LDST

..

self-invalidate( )

L1 of Core 1 L1 of Core 2

Shared L2

Z W

Registration

Ack

Read miss

X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region

LD

V X R YV Z R W

R X V YR Z I W

Registration

R C1 R C2R C1 R C2

Ack

Read miss

R X V YR Z V W

self-invalidate( )reset filter

R X V YR Z I W

V X R YI Z R W

Z W


Optimization to reduce self-invalidation

1. loads in Registered state2. “Touched-atomic” bit– Set on first atomic load– Subsequent load don’t self-invalidate

• More in the paper

..

X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region

STLD

self-invalidate( )

LDLD


Overheads

• Hardware Bloom filter– 256 bits per core

• Storage overhead– One additional state, but no storage overhead (2 bits) – “Touched-atomic” bit per word in L1

• Communication overhead– Bloom filter piggybacked on lock transfer message– Writeback messages for locks • Lock writebacks carry more info


Evaluation Methodology

• Simulator: Simics + GEMS + Garnet• System Parameters– 16 in-order cores

• Workloads– SPLASH-2, PARSEC and STAMP– Unchanged except region/effect and self-invalidation

• Protocols– MESI and DeNovoND– With idealized locks and realistic locks


MESI vs. DeNovoND: Idealized lock

• DeNovoND performs comparable to MESI for all apps– For both DIL-INF and DIL-256

barnes ocean water fluidanimate streamcluster tsp kmeans ssca2


MESI vs. DeNovoND: Realistic lock

• pthread lock vs. distributed queue-based lock• DeNovoND performs comparable or better than MESI



Network Traffic (Realistic lock)

• DeNovoND has 33% less traffic than MESI (67% max)– No invalidation traffic– Reduced load misses due to lack of false sharing



Conclusions and Future Work

• DeNovoND: Efficient HW support for non-determinism– Minimal additional HW for safe non-determinism– Comparable performance to MESI– 30% lower network traffic than MESI– PLUS all advantages of DeNovo for deterministic codes

• Future work: broaden the application space further– Pipeline parallelism, “lock-free” data structures, OS, legacy

codes…

Documents

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism