Upload
truong
View
19
Download
0
Embed Size (px)
DESCRIPTION
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. Hyojin Sung , Rakesh Komuravelli , and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Shared memory is de-facto model for multicore SW and HW BUT … - PowerPoint PPT Presentation
Citation preview
Illinois-Intel Parallelism Center
DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism
Hyojin Sung, Rakesh Komuravelli, and Sarita V. AdveDepartment of Computer Science
University of Illinois at Urbana-Champaign
Illinois-Intel Parallelism Center
Motivation• Shared memory is de-facto model for multicore SW and HW• BUT …– Complex SW: data races, unstructured parallelism, memory model, …– Inefficient HW: complex coherence/consistency, unnecessary traffic, …
• Recent work on disciplined shared memory– SW: Easier programming model– HW: If SW is more disciplined, can we build more efficient HW?• DeNovo: Holistic rethinking of entire memory hierarchy
Illinois-Intel Parallelism Center
Disciplined Shared Memory
Disciplined Shared-Memory =
Global address space +
Implicit, anywhere communication, synchronizationExplicit, structured side-effects
Illinois-Intel Parallelism Center
Disciplined Shared Memory
Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics
DeNovo – performance, complexity and power efficient • Simplify coherence and consistency
OOPSLA ‘09
PACT ‘11
Disciplined Shared Memoryexplicit effects structured parallel control
Illinois-Intel Parallelism Center
Limitation• DeNovo for deterministic programs– Important assumptions
1. No conflicting concurrent accesses, only barrier synchronization2. Known side-effects
– Allowed DeNovo to eliminate design complexity and inefficiency
• Challenges for nondeterministic programs – The assumptions do not hold any more
1. Can have conflicting concurrent accesses, support lock synchronization
2. Side-effects unknown in critical sections– Applications with lock-based non-determinism are common
Illinois-Intel Parallelism Center
Deterministic Parallel Java (DPJ) – strong safety properties• Determinism-by-default, simple semantics •
Contribution
DeNovoND: Non-deterministic codes with benefits of DeNovo• Minimal additional HW for non-determinism• Comparable performance to MESI• 30% lower network traffic than MESI• PLUS all advantages of DeNovo for deterministic codes
Explicit & safe non-determinism POPL ‘11
Disciplined Shared Memoryexplicit effects structured parallel control
Illinois-Intel Parallelism Center
Outline
• Motivation• Background– DPJ/DeNovo for deterministic codes– DPJ support for disciplined non-determinism
• DeNovoND Design • DeNovoND Implementation• Evaluation• Conclusion and Future Work
Illinois-Intel Parallelism Center
DPJ for Deterministic Codes...
...
heap
• Structured parallel control– Fork-join parallelism
• Explicit region and effect– Regions divide heap– Read or write effects on regions
• Data-race freedom guarantee– Simple, modular type checking write
effect
ST ST ST ST
LD
Illinois-Intel Parallelism Center
DPJ for Deterministic Codes...
...
heap
• Java-compatible type system• Structured parallel control– Fork-join parallelism
• Explicit region and effect– Regions divide heap– Read or write effects on regions
• Data-race freedom guarantee– Simple, modular type checking
writeeffect
ST ST ST ST
LD
Hardware – simplify coherence problems!
Illinois-Intel Parallelism Center
DeNovo for Deterministic Codes• Coherence Enforcement
1. Invalidate stale copies in private cache2. Track up-to-date copy
• Explicit effects– Compiler knows all writeable regions in this parallel phase– Cache can self-invalidate before next parallel phase
• Registration– Directory keeps track of one up-to-date copy– Writer registers itself before next parallel phase
Illinois-Intel Parallelism Center
DeNovo for Deterministic Codes
• No space overhead– Keep valid data or registered core id– LLC data arrays double as directory
• No transient states• No invalidation traffic• No false sharing
registry
Invalid Valid
Registered
Read
Write Write
Illinois-Intel Parallelism Center
V X R Y
Example Run
ST ST
V X V Y
V X V Y
R C1 R C2
..
X in DeNovo-region Y in DeNovo-region
self-invalidate( )
L1 of Core 1 L1 of Core 2
Shared L2
RegisteredValidInvalid
V X V Y
R X V Y
Registration Registration
Ack Ack
R X I Y I X R Y
Illinois-Intel Parallelism Center
DPJ Support for Safe Non-Determinism
• Nondeterminism comes from conflicting concurrent accesses
• Isolate these accesses as “atomic”– Enclosed in “atomic” sections– “Atomic” regions and effects
→ “Disciplined” non-determinism- Race freedom, strong isolation- Determinism-by-default semantics
• DeNovoND converts “atomic” statements into locks
...
...
ST
LD
Illinois-Intel Parallelism Center
Outline• Motivation• Background• DeNovoND Design– Memory Consistency Model– Distributed Queue-based Lock
• DeNovoND Implementation• Evaluation• Conclusion and Future Work
Illinois-Intel Parallelism Center
...
..
Memory Consistency Model• Deterministic accesses
1. Same task in this parallel phase2. Or before this parallel phase
LD 0xa
ParallelPhase
ST 0xaDeNovoCoherenceMechanism
Illinois-Intel Parallelism Center
...
..
Memory Consistency Model• Non-deterministic accesses
1. Same task in this parallel phase2. Or before this parallel phase3. Or in preceding critical sections
LD 0xa
ST 0xaST 0xa
CriticalSection
ParallelPhase
Illinois-Intel Parallelism Center
Coherence for non-deterministic data
• When to invalidate? – Between the start of critical section and any read
• What to invalidate?– Entire cache? regions with “atomic” effect?– Track atomic writes in a signature, transfer with lock
• Registration– Writer updates before next critical section
• Coherence Enforcement1. Invalidate stale copies in private cache2. Track up-to-date copy
Illinois-Intel Parallelism Center
Distributed Queue-based Lock• Lock primitive that works on DeNovoND– No directory, no write invalidation No spinning for lock
• Modeled after QOSB Lock– Lock requests form a distributed queue– But much simpler
• Details in the paper
Illinois-Intel Parallelism Center
Outline• Motivation• Background• DeNovoND Design• DeNovoND Implementation• Evaluation• Conclusion and Future Work
Illinois-Intel Parallelism Center
Access Signatures
• Simple and small hardware Bloom filter per core– Track accesses with “atomic” effects only– Only 256 bits suffice
• Operations on Bloom filter – On write: insert address– On read: query filter for address for self-invalidation
Illinois-Intel Parallelism Center
lock transfer
V X R YV Z V W
V X R YI Z V W
V X R YV Z V W
R X V YV Z V WR X V YR Z V W
R C1 R C2V Z V W
R C1 R C2R C1 V W
lock transfer
Example Run
ST LDST
..
self-invalidate( )
L1 of Core 1 L1 of Core 2
Shared L2
Z W
Registration
Ack
Read miss
X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region
LD
V X R YV Z R W
R X V YR Z I W
Registration
R C1 R C2R C1 R C2
Ack
Read miss
R X V YR Z V W
self-invalidate( )reset filter
R X V YR Z I W
V X R YI Z R W
Z W
Illinois-Intel Parallelism Center
Optimization to reduce self-invalidation
1. loads in Registered state2. “Touched-atomic” bit– Set on first atomic load– Subsequent load don’t self-invalidate
• More in the paper
..
X in DeNovo-region Y in DeNovo-regionZ in atomic DeNovo-regionW in atomic DeNovo-region
STLD
self-invalidate( )
LDLD
Illinois-Intel Parallelism Center
Overheads
• Hardware Bloom filter– 256 bits per core
• Storage overhead– One additional state, but no storage overhead (2 bits) – “Touched-atomic” bit per word in L1
• Communication overhead– Bloom filter piggybacked on lock transfer message– Writeback messages for locks • Lock writebacks carry more info
Illinois-Intel Parallelism Center
Evaluation Methodology
• Simulator: Simics + GEMS + Garnet• System Parameters– 16 in-order cores
• Workloads– SPLASH-2, PARSEC and STAMP– Unchanged except region/effect and self-invalidation
• Protocols– MESI and DeNovoND– With idealized locks and realistic locks
Illinois-Intel Parallelism Center
MESI vs. DeNovoND: Idealized lock
• DeNovoND performs comparable to MESI for all apps– For both DIL-INF and DIL-256
barnes ocean water fluidanimate streamcluster tsp kmeans ssca2
Illinois-Intel Parallelism Center
MESI vs. DeNovoND: Realistic lock
• pthread lock vs. distributed queue-based lock• DeNovoND performs comparable or better than MESI
barnes ocean water fluidanimate streamcluster tsp kmeans ssca2
Illinois-Intel Parallelism Center
Network Traffic (Realistic lock)
• DeNovoND has 33% less traffic than MESI (67% max)– No invalidation traffic– Reduced load misses due to lack of false sharing
barnes ocean water fluidanimate streamcluster tsp kmeans ssca2
Illinois-Intel Parallelism Center
Conclusions and Future Work
• DeNovoND: Efficient HW support for non-determinism– Minimal additional HW for safe non-determinism– Comparable performance to MESI– 30% lower network traffic than MESI– PLUS all advantages of DeNovo for deterministic codes
• Future work: broaden the application space further– Pipeline parallelism, “lock-free” data structures, OS, legacy
codes…