View
215
Download
0
Embed Size (px)
Citation preview
1
A Study of Different Instantiations of the OpenMP Memory
Model and Their Software Cache Implementations
Chen ChenJoseph B Manzano
Ge GanGuang R. GaoVivek Sarkar
April 21st, 2009
2
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
3
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
4
Situation of Shared Memory Parallel Programming —
3-Tier Hierarchy of Programming Models
Joe Parallel Programmers
Parallel Programming
Specialists
Computer System Specialists
Have some basic knowledge of parallel programming languages
Have some basic knowledge of memory consistency and
cache organizations
Expert on parallel system
architecture
5
The OpenMP Memory Model• Temporary view
– Register, cache, local storage
– Cache variables– Not required
• Flush operation– Enforce consistency– Flush-set– Reordering restriction– Serialized requirement
• Data-race program– Unspecified behavior
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Is the OpenMP Memory Model well-defined?
66
The OpenMP Memory Model is not Well-defined
• Complex semantic of the temporary view– Some threads own temporary views, others not– Why access memory if the temporary view has a copy?
• Unspecified semantics of data-race programs– Why reordering (between flush and memory accesses)
is still restricted?– The applications are limited to be data-race-free.
• Unclear definition of the flush operations– Variables may “escape” temporary view before flush– Serialized requirement is unnecessary
7
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
88
Our solution
• Simple semantics for the temporary view.– We defined ModelIDEAL with very simple semantics.
• Specified behaviors of all programs.– We defined four instantiations of the OpenMP memory
model. Each one has specified semantics for any program. (They are equivalent for DRF programs.)
• Clear definition of the flush operation.– We defined simple semantics of the flush operation.– We introduced the non-deterministic flush operation to
solve the space limitation problem.
99
ModelIDEAL
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Each thread owns a temporary view.
Infinitely big space.
Write: Access of the temporary view.
Read: 1) Access of the temporary view (Hit); or 2) Access of the main memory (Miss).
Flush: Writing back “dirty values” and discarding all the values. (one thread)
1010
ModelGF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush: A flush operation can be performed at any time. (To solve the limited space problem)
Global flush: Flush operation on all of the threads.
1111
ModelLF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush.
Local Flush: Flush operation on one thread.
1212
ModelRLF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush.
Acquire: Discarding the “clean value”.
Release: Writing back “dirty value”.
Barrier: Acquire + Release.
13
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
1414
Implementations – Cache Protocols
• Each thread contains a cache which corresponds to its temporary view.
• Each operation is performed on one cache line.
• Per-location dirty bits in each cache line.
1515
Centralized Directory for ModelGF
• Directory in the main memory
• Information of all the caches
• A flush will look up the directory and inform the involved threads
16
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
1717
Cell Architecture
• Very small local storage per SPE
• Local storage stores both data and instructions
• SPE accesses global memory by DMA transfers
Element Interconnect Bus (EIB)Power
processingelement(PPE)
Globalmemory
Synergistic processing elements
(SPE)
Local storage256K
Local storage256K
Local storage256K
Local storage256K
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Local storage256K
Local storage256K
Local storage256K
Local storage256K
1818
OPELL Framework
Single Source Compiler
Generating sequential codes for PPE
Generating parallelcodes for SPEs
RuntimeSystem (PPE)
PartitionManager
SoftwareCache
Managing softwarecaches on SPEs
Loading/Unloading SPEs’ codes
Executing the sequential codes
Triggering PM to execute tasks
Task assignments
Remote function calls
RuntimeSystem (SPE)
• An open source toolchain / runtime effort to implement OpenMP for the CBE
SWC is in the local storage
1919
Experimental Testbeds
• Hardware (PlayStation 3tm)– 3.2 GHz Cell Broadband Engine CPU (with 6 accessible
SPEs)– 256MB global shared memory.
• Software Framework (OPELL)– An open source toolchain / runtime effort to implement
OpenMP for the CBE.• Benchmarks
– RandomAccess and Stream from the HPC Challenge benchmark suite.
– Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks.
2020
Summary of Main Results
• Performance and Scalability– ModelLF consistently outperforms ModelGF
– Good scalability of ModelLF
• Impact of Cache Line Eviction – The cache line eviction has a significant impact on the
performance difference between ModelLF outperforms ModelGF
– Such difference become larger as the cache size (per core/thread) is becoming smaller
• Programmability – In our experiments, little changes are needed on the OpenMP code– In other words, the performance advantage is achieved without
compromising the programmability
2121
ModelGF vs. ModelLF on Execution Time (Cache size = 32K)
Performance improvement:
EP-A: 1.53×IS-W: 1.32×MG-W: 1.19×RandomAccess: 1.05×Stream: 1.36×
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
EP- A I S- W MG- W RandomAccess St ream
Norm
aliz
ed E
xecu
tion
Tim
e
Model - LFModel -GF
ModelLF consistently outperforms ModelGF
2222
Speedup as a Function of the Number of SPEs under ModelLF
IS-W and EP-Wachieve almostlinear speedup.
MG-W performsworse becauseof unbalancedworkloads. (3, 5 or 6 SPEs)
0
1
2
3
4
5
6
1 2 3 4 5 6
Number of Threads (SPEs)
Spee
dup
MG-WI S-WEP-W
2323
ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for IS-W
The difference of normalized exec-ution time increa-sed from 0.15 to 0.25 as the cache size per SPE was decreased from 64KB to 4KB.
The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
4k 8k 16k 32k 64kCache Si zes
Norm
aliz
ed E
xecu
tion
Tim
e
Model - GFModel - LF
0. 00%
1. 00%
2. 00%
3. 00%
4. 00%
5. 00%
6. 00%
7. 00%
8. 00%
9. 00%
10. 00%
4k 8k 16k 32k 64kCache Si zes
Cach
e Ev
icti
on R
atio
Model - GFModel - LF
2424
ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for MG-W
The difference of normalized exec-ution time increa-sed from 0.04 to 0.16 as the cache size per SPE was decreased from 32KB to 4KB.
The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.
GF
LF
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
4k 8k 16k 32kCache Si zes
Norm
aliz
ed E
xecu
tion
Tim
e
Model - GFModel - LF
0. 00%
0. 10%
0. 20%
0. 30%
0. 40%
0. 50%
0. 60%
0. 70%
0. 80%
0. 90%
1. 00%
4k 8k 16k 32kCache Si zes
Cach
e Ev
icti
on R
atio
Model - GFModel - LF
2525
Conclusion and Future Work
• Our contributions– Formalization of the OpenMP memory model
– Performance studies of ModelGL and ModelLF
• Future work– Studies of ModelRLF
– Tests on more benchmarks– Evaluations on more many-core architectures
2626
Acknowledgement
• This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF-0702244), and other government sponsors.
• Joseph B Manzano, Ge Gan, Guang R. Gao and Vivek Sarkar are co-authors of the paper
• Joseph B Manzano and Guang R. Gao gave a lot of useful comments on the slides
27
BACKUP
2828
Example (1) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp flush (X) } #pragma omp section { // Section 2, assume it is running on thread T2.3: *q = 2;4: #pragma omp flush (X) } #pragma omp section { // Section 3, assume it is running on thread T3.5: #pragma omp flush (X) // Assume that compiler cannot establish that p == q6: v1 = *p;7: v2 = *q;8: v3 = *p; } }
ModelIDEAL: v1, v2 and v3 always read the same value. (E.g. {v1==v2==v3==0(or 1, 2)})
ModelGF: v1, v2 and v3 may read different values. (E.g. {v1==1, v2==v3==2} if there is a non-deterministic flush between statements 6 and 7.)
ModelLF: v1, v2 and v3 may read different values.
The order 1-3-2-4-5-6-7-8 results in {v1==v2==v3==2} under ModelIDEAL; {v1==v2==v3==1} under ModelGF; and {v1==v2==v3==2} under ModelLF.
2929
Example (2) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp critical // A flush (acquire) here.3: v1 = *p;4: // A flush (release) here. } #pragma omp section { // Section 2, assume it is running on thread T2.5: *q = 2;6: #pragma omp critical // A flush (acquire) here.7: v1 = *q;8: // A flush (release) here. } }
ModelIDEAL, ModelGF, and ModelGF: Statements 2 and 6 will remove the values in the temporary views – Statements 3 and 7 have to access the main memory.
ModelRLF: Statements 2 and 6 are acquire operations. The values are preserved in the temporary views – Statements 3 and 7 can access the temporary views to get the values.
3030
States of Cache Lines
• Invalid: All the words of the cache line are invalid.
• Clean: All the words of the cache line contain “clean values”.
• Dirty: All the words of the cache line contain “dirty values”.
• Clean-Dirty: Clean + Dirty
• Invalid-Dirty: Invalid + Dirty
31
State transition diagram for the cache protocol of ModelGF and ModelLF
Invalid Clean
Dirty
read
Clean-DirtyInvalid-Dirty read
read
read/write
read/write
write write
write
write write
write write
flush
flush
flushflush
flush
32
State transition diagram for the cache protocol of ModelRLF
Invalid Clean
Dirty
read
Clean-DirtyInvalid-Dirty
read
read/release
read/write
read/write/acquire
write write
write/acquire
write write
writewrite
acquire/barrier/flush
acquire
acquire/release/barrier/flush
release/barrier
release
release barrier/flushbarrier/
flush
flush
3333
Overall Experimental Results: ModelGF vs. ModelLF
benchmarks Performance (execution time) improvement
4K Cache 8K 16K 32K 64K
EP-A 1.53×
IS-W 1.33× 1.32× 1.32× 1.32× 1.32×
MG-W 1.20× 1.23× 1.21× 1.19×
RandomAccess 1.05×
Stream 1.36×
ModelLF consistently outperforms ModelGF