58
Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA Basic Concepts in Parallelization

1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

  • Upload
    dodiep

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

1

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

IWOMP 2010CCS, University of Tsukuba

Tsukuba, JapanJune 14-16, 2010

Ruud van der Pas

Senior Staff EngineerOracle Solaris Studio

OracleMenlo Park, CA, USA

Basic Concepts in Parallelization

Page 2: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

2

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Outline❑ Introduction

❑ Parallel Architectures

❑ Parallel Programming Models

❑ Data Races

❑ Summary

Page 3: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

3

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Introduction

Page 4: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

4

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Why Parallelization ?

Parallelization is another optimization techniqueThe goal is to reduce the execution time

To this end, multiple processors, or cores, are usedT

ime

1 core 4 coresparallelization

Using 4 cores, the execution time is 1/4 of the single core

time

Page 5: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

5

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

What Is Parallelization ?

"Something" is parallel if there is a certain level of independence in the order of operations

A sequence of machine instructions

A collection of program statements

An algorithm

The problem you're trying to solve

gran

ula rity

In other words, it doesn't matter in what order those operations are performed

Page 6: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

6

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

What is a Thread ? Loosely said, a thread consists of a series of instructions

with it's own program counter (“PC”) and state

A parallel program executes threads in parallel

These threads are then scheduled onto processors

.....

.....

.....

.....

.....

.....

.....

.....

Thread 0

.....

.....

.....

.....

.....

.....

.....

.....

Thread 1

.....

.....

.....

.....

.....

.....

.....

.....

Thread 2

.....

.....

.....

.....

.....

.....

.....

.....

Thread 3

P P

PCPC

PC

PC

Page 7: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

7

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel overhead❑ The total CPU time often exceeds the serial CPU time:

● The newly introduced parallel portions in your program need to be executed

● Threads need time sending data to each other and synchronizing (“communication”)✔ Often the key contributor, spoiling all the fun

❑ Typically, things also get worse when increasing the number of threads

❑ Effi cient parallelization is about minimizing the communication overhead

Page 8: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

8

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Communication

SerialExecution

Wal

lclo

ck t

ime

Parallel - With communication

Additional communication

Less than 4x faster

Consumes additional resources

Wallclock time is more than ¼ of serial wallclock time

Total CPU time increases

Parallel - Without communication

Embarrassingly parallel: 4x faster

Wallclock time is ¼ of serial wallclock time

Communication

Page 9: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

9

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Load balancingT

ime

Perfect Load Balancing Load Imbalance

All threads fi nish in the same amount of time

No threads is idle

Thread is idle

Different threads need a different amount of time to fi nish their task

Total wall clock time increases

Program does not scale well

1 idle2 idle3 idle

Page 10: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

10

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About scalability Defi ne the speed-up S(P) as

S(P) := T(1)/T(P)

The effi ciency E(P) is defi ned as E(P) := S(P)/P

In the ideal case, S(P)=P and E(P)=P/P=1=100%

Unless the application is embarrassingly parallel, S(P) eventually starts to deviate from the ideal curve

Past this point Popt

, the application sees less and less benefi t from adding processors

Note that both metrics give no information on the actual run-time

As such, they can be dangerous to use

Ideal

Sp

eed

-up

S(P

)

PPopt

In some cases, S(P) exceeds P

This is called "superlinear" behaviour

Don't count on this to happen though

Page 11: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

11

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Amdahl's Law

▶ This "law' describes the effect the non-parallelizable part of a program has on scalability

▶ Note that the additional overhead caused by parallelization and speed-up because of cache effects are not taken into account

Assume our program has a parallel fraction “f”

This implies the execution time T(1) := f*T(1) + (1-f)*T(1)

On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1)

Amdahl's law:

S(P) = T(1)/T(P) = 1 / (f/P + 1-f)

Comments:

Page 12: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

12

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Amdahl's Law It is easy to scale on a

small number of processors

Scalable performance however requires a high degree of parallelization i.e. f is very close to 1

This implies that you need to parallelize that part of the code where the majority of the time is spent

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2

4

6

8

10

12

14

16

1.000.990.750.500.250.00

Value for f:

Processors

Sp

eed

-up

Page 13: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

13

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Amdahl's Law in practice

f = (1 - T(P)/T(1))/(1 - (1/P))

We can estimate the parallel fraction “f”

Recall: T(P) = (f/P)*T(1) + (1-f)*T(1)

It is trivial to solve this equation for “f”:

Example:

T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%

Estimated performance on 8 processors is then:

T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5S(8) = T/T(8) = 3.78

Page 14: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

14

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Threads Are Getting Cheap .....

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

90

100

Elapsed time and speed up for f = 84%

Number of cores used

Using 8 cores:100/26.5 = 3.8x faster

= Elapsed time= Speed up

Page 15: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

15

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Numerical results

Consider:A = B + C + D + E

Serial Processing Parallel Processing

Thread 0 Thread 1

☞ The roundoff behaviour is different and so the numerical results may be different too

☞ This is natural for parallel programs, but it may be hard to differentiate it from an ordinary bug ....

A = B + C

A = A + D

A = A + E

T1 = B + C

T1 = T1 + T2

T2 = D + E

Page 16: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

16

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Cache Coherence

Page 17: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

17

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Typical cache based system

L2cacheCPU MemoryL1

cache?

Page 18: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

18

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Cache line modifi cations

Main Memory

caches

page cache line

cache lineinconsistency !

Page 19: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

19

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Caches in a parallel computer❑ A cache line starts in

memory

❑ Over time multiple copies of this line may exist

Cache Coherence ("cc"):

✔ Tracks changes in copies

✔ Makes sure correct cache line is used

✔ Different implementations possible

✔ Need hardware support to make it effi cient

Processors Caches Memory

CPU 0

CPU 1

CPU 2

CPU 3

invalidated

invalidated

invalidated

Page 20: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

20

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel Architectures

Page 21: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

21

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Uniform Memory Access (UMA)❑ Also called "SMP"

(Symmetric Multi Processor)

❑ Memory Access time is Uniform for all CPUs

❑ CPU can be multicore

❑ Interconnect is "cc":

● Bus● Crossbar

❑ No fragmentation - Memory and I/O are shared resources

Memory I/O

cache

CPU

✔ Easy to use and to administer

✔ Effi cient use of resources

✔ Said to be expensive

✔ Said to be non-scalable

Pro

Con

cache

CPU

cache

CPU

Interconnect I/O

I/O

Page 22: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

22

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

NUMA❑ Also called "Distributed

Memory" or NORMA (No Remote Memory Access)

❑ Memory Access time is Non-Uniform

❑ Hence the name "NUMA"

❑ Interconnect is not "cc":

● Ethernet, Infi niband, etc, ......

❑ Runs 'N' copies of the OS

❑ Memory and I/O are distributed resources

✔ Said to be cheap

✔ Said to be scalable

✔ Diffi cult to use and administer

✔ In-effi cient use of resources

Pro

Con

I/O

cache

CPU

M

I/O

cache

CPU

M

I/O

cache

CPU

M

Interconnect

Page 23: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

23

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The Hybrid Architecture

Second-level Interconnect

❑ Second-level interconnect is not cache coherent

● Ethernet, Infi niband, etc, ....❑ Hybrid Architecture with all Pros and Cons:

● UMA within one SMP/Multicore node● NUMA across nodes

SMP/MC node SMP/MC node

Page 24: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

24

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

cc-NUMA

❑ Two-level interconnect:

● UMA/SMP within one system● NUMA between the systems

❑ Both interconnects support cache coherence i.e. the system is fully cache coherent

❑ Has all the advantages ('look and feel') of an SMP

❑ Downside is the Non-Uniform Memory Access time

Interconnect

2-nd level

Page 25: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

25

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel Programming Models

Page 26: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

26

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

How To Program A Parallel Computer?❑ There are numerous parallel programming models

❑ The ones most well-known are:

● Distributed Memory✔ Sockets (standardized, low level)

✔ PVM - Parallel Virtual Machine (obsolete)

✔ MPI - Message Passing Interface (de-facto std)

● Shared Memory

✔ Posix Threads (standardized, low level)

✔ OpenMP (de-facto standard)

✔ Automatic Parallelization (compiler does it for you)

Any Cluster

and/or

single SMP

SMP only

Page 27: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

27

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel Programming ModelsDistributed Memory - MPI

Page 28: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

28

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

What is MPI?❑ MPI stands for the “Message Passing Interface”

❑ MPI is a very extensive de-facto parallel programming API for distributed memory systems (i.e. a cluster)

● An MPI program can however also be executed on a shared memory system

❑ First specifi cation: 1994

❑ The current MPI-2 specifi cation was released in 1997

● Major enhancements✔ Remote memory management✔ Parallel I/O✔ Dynamic process management

Page 29: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

29

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

More about MPI❑ MPI has its own data types (e.g. MPI_INT)

● User defi ned data types are supported as well❑ MPI supports C, C++ and Fortran

● Include fi le <mpi.h> in C/C++ and “mpif.h” in Fortran❑ An MPI environment typically consists of at least:

● A library implementing the API● A compiler and linker that support the library● A run time environment to launch an MPI program

❑ Various implementations available

● HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire MPI, Scali MPI, HP-MPI, .....

Page 30: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

30

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The MPI Programming Model

1 M

4M 5 M

2M 3 M

0M

A Cluster Of Systems

Page 31: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

31

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The MPI Execution ModelAll processes

start

MPI_Finalize

All processes fi nish

= communication

MPI_Init

Page 32: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

32

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The MPI Memory Model

Pprivate

Pprivate

Pprivate

Pprivate

P

private

TransferMechanism

✔ All threads/processes have access to their own, private, memory only

✔ Data transfer and most synchronization has to be programmed explicitly

✔ All data is private

✔ Data is shared explicitly by exchanging buffers

Page 33: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

33

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example - Send “N” Integers

#include <mpi.h>

you = 0; him = 1;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &me);

if ( me == 0 ) {

error_code = MPI_Send(&data_buffer, N, MPI_INT,

him, 1957, MPI_COMM_WORLD);

} else if ( me == 1 ) {

error_code = MPI_Recv(&data_buffer, N, MPI_INT,

you, 1957, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

}

MPI_Finalize();

include fi le

initialize MPI environment

get process ID

process 0 sends

process 1 receives

leave the MPI environment

Page 34: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

34

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

you = 1him = 0

Process 0

you = 1him = 0

Process 1

Run time Behavior

N integersdestination = you = 1

label = 1957

MPI_Send

me = 0 me = 1

Yes ! Connection established

MPI_RecvN integers

sender = him =0label = 1957

Page 35: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

35

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The Pros and Cons of MPI❑ Advantages of MPI:

● Flexibility - Can use any cluster of any size● Straightforward - Just plug in the MPI calls● Widely available - Several implementations out there● Widely used - Very popular programming model

❑ Disadvantages of MPI:

● Redesign of application - Could be a lot of work● Easy to make mistakes - Many details to handle● Hard to debug - Need to dig into underlying system● More resources - Typically, more memory is needed● Special care - Input/Output

Page 36: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

36

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel Programming Models Shared Memory - Automatic

Parallelization

Page 37: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

37

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Automatic Parallelization (-xautopar)❑ Compiler performs the parallelization (loop based)

❑ Different iterations of the loop executed in parallel

❑ Same binary used for any number of threads

for (i=0; i<1000; i++) a[i] = b[i] + c[i];

Thread 3

750-999

Thread 2

500-749

Thread 1

250-499

Thread 0

0-249

OMP_NUM_THREADS=4

Page 38: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

38

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallel Programming Models Shared Memory - OpenMP

Page 39: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

39

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

http://www.openmp.org

0

M

1

M

P

M

Shared Memory

Page 40: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

40

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Shared Memory Model

Tprivate

Tprivate

Tprivate

Tprivate

T

private

Programming Model

SharedMemory

✔ All threads have access to the same, globally shared, memory

✔ Data can be shared or private

✔ Shared data is accessible by all threads

✔ Private data can only be accessed by the thread that owns it

✔ Data transfer is transparent to the programmer

✔ Synchronization takes place, but it is mostly implicit

Page 41: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

41

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About data In a shared memory parallel program variables have a

"label" attached to them:

☞ Labelled "Private" ⇨ Visible to one thread only

✔ Change made in local data, is not seen by others✔ Example - Local variables in a function that is

executed in parallel☞ Labelled "Shared" ⇨ Visible to all threads

✔ Change made in global data, is seen by all others✔ Example - Global data

Page 42: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

42

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

TID = 0for (i=0,1,2,3,4)

TID = 1for (i=5,6,7,8,9)

Example - Matrix times vector

i = 0 i = 5

a[0] = sum a[5] = sum

sum = � b[i=0][j]*c[j] sum = � b[i=5][j]*c[j]

i = 1 i = 6

a[1] = sum a[6] = sum

sum = � b[i=1][j]*c[j] sum = � b[i=6][j]*c[j]

... etc ...

for (i=0; i<m; i++){ sum = 0.0; for (j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum;

}

#pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c)

= *

j

i

Page 43: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

43

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A Black and White comparisonMPI OpenMP

De-facto standard De-facto standard

Endorsed by all key players Endorsed by all key players

Runs on any number of (cheap) systems Limited to one (SMP) system

“Grid Ready” Not (yet?) “Grid Ready”

High and steep learning curve Easier to get started (but, ...)

You're on your own Assistance from compiler

All or nothing model Mix and match model

No data scoping (shared, private, ..) Requires data scoping

More widely used (but ....) Increasingly popular (CMT !)

Sequential version is not preserved Preserves sequential code

Requires a library only Need a compiler

Requires a run-time environment No special environment

Easier to understand performance Performance issues implicit

Page 44: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

44

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The Hybrid Parallel Programming Model

Page 45: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

45

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

The Hybrid Programming Model

0

M

1

M

P

M

Shared Memory

Distributed Memory

0

M

1

M

P

M

Shared Memory

Page 46: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

46

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Data Races

Page 47: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

47

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About Parallelism

Parallelism

Independence

No Fixed Ordering

"Something" that does not obey this rule, is not

parallel (at that level ...)

Page 48: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

48

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Shared Memory Programming

Threads communicate via shared memory

Tprivate

Tprivate

Tprivate

Tprivate

T

private

SharedMemory

XX

X

Y

Page 49: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

49

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

What is a Data Race?❑ Two different threads in a multi-threaded shared

memory program

❑ Access the same (=shared) memory location

● Asynchronously and● Without holding any common exclusive locks and● At least one of the accesses is a write/store

Page 50: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

50

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Example of a data race

Tprivate

Tprivate

SharedMemory

n

{n = omp_get_thread_num();}

#pragma omp parallel shared(n)

WW

Page 51: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

51

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Another example

{x = x + 1;}

#pragma omp parallel shared(x)

Tprivate

Tprivate

SharedMemory

x

R/WR/W

Page 52: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

52

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About Data Races❑ Loosely described, a data race means that the update

of a shared variable is not well protected

❑ A data race tends to show up in a nasty way:

● Numerical results are (somewhat) different from run to run

● Especially with Floating-Point data diffi cult to distinguish from a numerical side-effect

● Changing the number of threads can cause the problem to seemingly (dis)appear✔ May also depend on the load on the system

● May only show up using many threads

Page 53: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

53

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

A parallel loop

for (i=0; i<8; i++) a[i] = a[i] + b[i];

Thread 1 Thread 2

a[0]=a[0]+b[0] a[4]=a[4]+b[4]

a[1]=a[1]+b[1] a[5]=a[5]+b[5]

a[2]=a[2]+b[2] a[6]=a[6]+b[6]

a[3]=a[3]+b[3] a[7]=a[7]+b[7] Time

Every iteration in this loop is independent of

the other iterations

Page 54: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

54

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Not a parallel loop

for (i=0; i<8; i++) a[i] = a[i+1] + b[i];

Thread 1 Thread 2

a[0]=a[1]+b[0] a[4]=a[5]+b[4]

a[1]=a[2]+b[1] a[5]=a[6]+b[5]

a[2]=a[3]+b[2] a[6]=a[7]+b[6]

a[3]=a[4]+b[3] a[7]=a[8]+b[7] Time

The result is not deterministic when

run in parallel !

Page 55: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

55

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About the experiment❑ We manually parallelized the previous loop

● The compiler detects the data dependence and does not parallelize the loop

❑ Vectors a and b are of type integer

❑ We use the checksum of a as a measure for correctness:

● checksum += a[i] for i = 0, 1, 2, ...., n-2❑ The correct, sequential, checksum result is computed

as a reference

❑ We ran the program using 1, 2, 4, 32 and 48 threads

● Each of these experiments was repeated 4 times

Page 56: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

56

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Numerical results threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953

threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953

threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1953 correct value 1953threads: 4 checksum 1937 correct value 1953

threads: 32 checksum 1525 correct value 1953threads: 32 checksum 1473 correct value 1953threads: 32 checksum 1489 correct value 1953threads: 32 checksum 1513 correct value 1953

threads: 48 checksum 936 correct value 1953threads: 48 checksum 1007 correct value 1953threads: 48 checksum 887 correct value 1953threads: 48 checksum 822 correct value 1953

Data Race In Action !

Page 57: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

57

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Summary

Page 58: 1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law

Basic Concepts in Parallelization

58

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

Parallelism Is Everywhere

Multiple levels of parallelism:

Granularity Technology Programming Model

Instruction Level Superscalar CompilerChip Level Multicore Compiler, OpenMP, MPISystem Level SMP/cc-NUMA Compiler, OpenMP, MPIGrid Level Cluster MPI

Threads Are Getting Cheap