1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek

1

A Study of Different Instantiations of the OpenMP Memory

Model and Their Software Cache Implementations

Chen ChenJoseph B Manzano

Ge GanGuang R. GaoVivek Sarkar

April 21st, 2009

2

Outline

• The OpenMP memory model is not well-defined

• Our solution: Four well-defined instantiations of the OpenMP memory model

• Implementations – Cache Protocols

• Experimental Results

3

Outline





4

Situation of Shared Memory Parallel Programming —

3-Tier Hierarchy of Programming Models

Joe Parallel Programmers

Parallel Programming

Specialists

Computer System Specialists

Have some basic knowledge of parallel programming languages

Have some basic knowledge of memory consistency and

cache organizations

Expert on parallel system

architecture

5

The OpenMP Memory Model• Temporary view

– Register, cache, local storage

– Cache variables– Not required

• Flush operation– Enforce consistency– Flush-set– Reordering restriction– Serialized requirement

• Data-race program– Unspecified behavior

Thread

Temporary View of

the Memory

Thread

Temporary View of

the Memory

Interconnection Network

Main Memory

Thread

Is the OpenMP Memory Model well-defined?

66

The OpenMP Memory Model is not Well-defined

• Complex semantic of the temporary view– Some threads own temporary views, others not– Why access memory if the temporary view has a copy?

• Unspecified semantics of data-race programs– Why reordering (between flush and memory accesses)

is still restricted?– The applications are limited to be data-race-free.

• Unclear definition of the flush operations– Variables may “escape” temporary view before flush– Serialized requirement is unnecessary

7

Outline





88

Our solution

• Simple semantics for the temporary view.– We defined ModelIDEAL with very simple semantics.

• Specified behaviors of all programs.– We defined four instantiations of the OpenMP memory

model. Each one has specified semantics for any program. (They are equivalent for DRF programs.)

• Clear definition of the flush operation.– We defined simple semantics of the flush operation.– We introduced the non-deterministic flush operation to

solve the space limitation problem.

99

ModelIDEAL

Thread

Temporary View of

the Memory

Thread

Temporary View of

the Memory


Main Memory

Thread

Temporary View of

the Memory

Each thread owns a temporary view.

Infinitely big space.

Write: Access of the temporary view.

Read: 1) Access of the temporary view (Hit); or 2) Access of the main memory (Miss).

Flush: Writing back “dirty values” and discarding all the values. (one thread)

1010

ModelGF

Thread

Temporary View of

the Memory

Thread

Temporary View of

the Memory


Main Memory

Thread

Temporary View of

the Memory

Limited space

Non-deterministic flush: A flush operation can be performed at any time. (To solve the limited space problem)

Global flush: Flush operation on all of the threads.

1111

ModelLF

Thread

Temporary View of

the Memory

Thread

Temporary View of

the Memory


Main Memory

Thread

Temporary View of

the Memory

Limited space

Non-deterministic flush.

Local Flush: Flush operation on one thread.

1212

ModelRLF

Thread

Temporary View of

the Memory

Thread

Temporary View of

the Memory


Main Memory

Thread

Temporary View of

the Memory

Limited space

Non-deterministic flush.

Acquire: Discarding the “clean value”.

Release: Writing back “dirty value”.

Barrier: Acquire + Release.

13

Outline





1414

Implementations – Cache Protocols

• Each thread contains a cache which corresponds to its temporary view.

• Each operation is performed on one cache line.

• Per-location dirty bits in each cache line.

1515

Centralized Directory for ModelGF

• Directory in the main memory

• Information of all the caches

• A flush will look up the directory and inform the involved threads

16

Outline





1717

Cell Architecture

• Very small local storage per SPE

• Local storage stores both data and instructions

• SPE accesses global memory by DMA transfers

Element Interconnect Bus (EIB)Power

processingelement(PPE)

Globalmemory

Synergistic processing elements

(SPE)

Local storage256K

Local storage256K

Local storage256K

Local storage256K


(SPE)


(SPE)


(SPE)


(SPE)


(SPE)


(SPE)


(SPE)

Local storage256K

Local storage256K

Local storage256K

Local storage256K

1818

OPELL Framework

Single Source Compiler

Generating sequential codes for PPE

Generating parallelcodes for SPEs

RuntimeSystem (PPE)

PartitionManager

SoftwareCache

Managing softwarecaches on SPEs

Loading/Unloading SPEs’ codes

Executing the sequential codes

Triggering PM to execute tasks

Task assignments

Remote function calls

RuntimeSystem (SPE)

• An open source toolchain / runtime effort to implement OpenMP for the CBE

SWC is in the local storage

1919

Experimental Testbeds

• Hardware (PlayStation 3tm)– 3.2 GHz Cell Broadband Engine CPU (with 6 accessible

SPEs)– 256MB global shared memory.

• Software Framework (OPELL)– An open source toolchain / runtime effort to implement

OpenMP for the CBE.• Benchmarks

– RandomAccess and Stream from the HPC Challenge benchmark suite.

– Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks.

2020

Summary of Main Results

• Performance and Scalability– ModelLF consistently outperforms ModelGF

– Good scalability of ModelLF

• Impact of Cache Line Eviction – The cache line eviction has a significant impact on the

performance difference between ModelLF outperforms ModelGF

– Such difference become larger as the cache size (per core/thread) is becoming smaller

• Programmability – In our experiments, little changes are needed on the OpenMP code– In other words, the performance advantage is achieved without

compromising the programmability

2121

ModelGF vs. ModelLF on Execution Time (Cache size = 32K)

Performance improvement:

EP-A: 1.53×IS-W: 1.32×MG-W: 1.19×RandomAccess: 1.05×Stream: 1.36×

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

EP- A I S- W MG- W RandomAccess St ream

Norm

aliz

ed E

xecu

tion

Tim

e

Model - LFModel -GF

ModelLF consistently outperforms ModelGF

2222

Speedup as a Function of the Number of SPEs under ModelLF

IS-W and EP-Wachieve almostlinear speedup.

MG-W performsworse becauseof unbalancedworkloads. (3, 5 or 6 SPEs)

0

1

2

3

4

5

6

1 2 3 4 5 6

Number of Threads (SPEs)

Spee

dup

MG-WI S-WEP-W

2323

ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for IS-W

The difference of normalized exec-ution time increa-sed from 0.15 to 0.25 as the cache size per SPE was decreased from 64KB to 4KB.

The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

4k 8k 16k 32k 64kCache Si zes

Norm

aliz

ed E

xecu

tion

Tim

e

Model - GFModel - LF

0. 00%

1. 00%

2. 00%

3. 00%

4. 00%

5. 00%

6. 00%

7. 00%

8. 00%

9. 00%

10. 00%

4k 8k 16k 32k 64kCache Si zes

Cach

e Ev

icti

on R

atio


2424

ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for MG-W

The difference of normalized exec-ution time increa-sed from 0.04 to 0.16 as the cache size per SPE was decreased from 32KB to 4KB.

The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.

GF

LF

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

4k 8k 16k 32kCache Si zes

Norm

aliz

ed E

xecu

tion

Tim

e


0. 00%

0. 10%

0. 20%

0. 30%

0. 40%

0. 50%

0. 60%

0. 70%

0. 80%

0. 90%

1. 00%

4k 8k 16k 32kCache Si zes

Cach

e Ev

icti

on R

atio


2525

Conclusion and Future Work

• Our contributions– Formalization of the OpenMP memory model

– Performance studies of ModelGL and ModelLF

• Future work– Studies of ModelRLF

– Tests on more benchmarks– Evaluations on more many-core architectures

2626

Acknowledgement

• This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF-0702244), and other government sponsors.

• Joseph B Manzano, Ge Gan, Guang R. Gao and Vivek Sarkar are co-authors of the paper

• Joseph B Manzano and Guang R. Gao gave a lot of useful comments on the slides

27

BACKUP

2828

Example (1) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp flush (X) } #pragma omp section { // Section 2, assume it is running on thread T2.3: *q = 2;4: #pragma omp flush (X) } #pragma omp section { // Section 3, assume it is running on thread T3.5: #pragma omp flush (X) // Assume that compiler cannot establish that p == q6: v1 = *p;7: v2 = *q;8: v3 = *p; } }

ModelIDEAL: v1, v2 and v3 always read the same value. (E.g. {v1==v2==v3==0(or 1, 2)})

ModelGF: v1, v2 and v3 may read different values. (E.g. {v1==1, v2==v3==2} if there is a non-deterministic flush between statements 6 and 7.)

ModelLF: v1, v2 and v3 may read different values.

The order 1-3-2-4-5-6-7-8 results in {v1==v2==v3==2} under ModelIDEAL; {v1==v2==v3==1} under ModelGF; and {v1==v2==v3==2} under ModelLF.

2929

Example (2) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp critical // A flush (acquire) here.3: v1 = *p;4: // A flush (release) here. } #pragma omp section { // Section 2, assume it is running on thread T2.5: *q = 2;6: #pragma omp critical // A flush (acquire) here.7: v1 = *q;8: // A flush (release) here. } }

ModelIDEAL, ModelGF, and ModelGF: Statements 2 and 6 will remove the values in the temporary views – Statements 3 and 7 have to access the main memory.

ModelRLF: Statements 2 and 6 are acquire operations. The values are preserved in the temporary views – Statements 3 and 7 can access the temporary views to get the values.

3030

States of Cache Lines

• Invalid: All the words of the cache line are invalid.

• Clean: All the words of the cache line contain “clean values”.

• Dirty: All the words of the cache line contain “dirty values”.

• Clean-Dirty: Clean + Dirty

• Invalid-Dirty: Invalid + Dirty

31

State transition diagram for the cache protocol of ModelGF and ModelLF

Invalid Clean

Dirty

read

Clean-DirtyInvalid-Dirty read

read

read/write

read/write

write write

write

write write

write write

flush

flush

flushflush

flush

32

State transition diagram for the cache protocol of ModelRLF

Invalid Clean

Dirty

read

Clean-DirtyInvalid-Dirty

read

read/release

read/write

read/write/acquire

write write

write/acquire

write write

writewrite

acquire/barrier/flush

acquire

acquire/release/barrier/flush

release/barrier

release

release barrier/flushbarrier/

flush

flush

3333

Overall Experimental Results: ModelGF vs. ModelLF

benchmarks Performance (execution time) improvement

4K Cache 8K 16K 32K 64K

EP-A 1.53×

IS-W 1.33× 1.32× 1.32× 1.32× 1.32×

MG-W 1.20× 1.23× 1.21× 1.19×

RandomAccess 1.05×

Stream 1.36×

ModelLF consistently outperforms ModelGF

Documents

1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek