Upload
peter-troeger
View
235
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Week 5 in the OpenHPI course on parallel programming concepts is about parallel applications in distributed systems. Find the whole course at http://bit.ly/1l3uD4h.
Citation preview
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.1: Hardware
Dr. Peter Tröger + Teaching Team
Summary: Week 4
■ Accelerators enable major speedup for data parallelism □ SIMD execution model (no branching)
□ Memory latency managed with many light-weight threads ■ Tackle diversity with OpenCL
□ Loop parallelism with index ranges □ Kernels in C, compiled at runtime
□ Complex memory hierarchy supported ■ Getting fast is easy, getting faster is hard
□ Best practices for accelerators □ Hardware knowledge needed
2
What if my computational problem still demands more power?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallelism for …
■ Speed – compute faster ■ Throughput – compute more in the same time
■ Scalability – compute faster / more with additional resources □ Huge scalability only with shared nothing systems □ Still also depends on application characteristics
Processing Element A1
Processing Element A2
Processing Element A3
Processing Element B1
Processing Element B2
Processing Element B3 Sca
ling
Up
Scaling Out
Mai
n M
emor
y
Mai
n M
emor
y
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Hardware
■ Shared memory system □ Typically a single machine, common address space for tasks
□ Hardware scaling is limited (power / memory wall) ■ Shared nothing (distributed memory) system □ Tasks on multiple machines, can only access local memory □ Global task coordination by explicit messaging
□ Easy scale-out by adding machines to the network
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Processing Element
Task
Shared Memory
Processing Element
Task
Processing Element
Task
Processing Element
Task
Mes
sage
Mes
sage
Mes
sage
Mes
sage
Cache Cache Local
Memory Local
Memory
Parallel Hardware
■ Shared memory system à collection of processors □ Integrated machine for capacity computing
□ Prepared for a large variety of problems ■ Shared-nothing system à collection of computers
□ Clusters and supercomputers for capability computing □ Installation to solve few problems in the best way
□ Parallel software must be able leverage multiple machines at the same time
□ Difference to distributed systems (Internet, Cloud) ◊ Single organizational domain, managed as a whole ◊ Single parallel application at a time,
no separation of client and server application ◊ Hybrids are possible (e.g. HPC in Amazon AWS cloud)
5
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Shared Nothing: Clusters
■ Collection of stand-alone machines connected by a local network □ Cost-effective technique for a large-scale parallel computer
□ Users are builders, have control over their system □ Synchronization much slower than in shared memory □ Task granularity becomes an issue
6
Processing Element
Task
Processing Element
Task
Mes
sage
Mes
sage
Mes
sage
Mes
sage
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Local Memory
Local Memory
Shared Nothing: Supercomputers
■ Supercomputers / Massively Parallel Processing (MPP) systems □ (Hierarchical) cluster with a lot of processors
□ Still standard hardware, but specialized setup □ High-performance interconnection network □ For massive data-parallel applications, mostly simulations
(weapons, climate, earthquakes, airplanes, car crashes, ...) ■ Examples (Nov 2013)
□ BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
□ Tianhe-2, 3.1 million cores, 1 PB memory, 17.808 kW power, 33.86 PFlops (quadrillions calculations per second)
■ Annual ranking with the TOP500 list (www.top500.org)
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example
8
© 2011 IBM Corporation
IBM System Technology Group
1. Chip:16+2 !P
cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 96 racks, 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus
•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency
Blue Gene/Q
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Interconnection Networks
■ Bus systems □ Static approach, low costs
□ Shared communication path, broadcasting of information
□ Scalability issues with shared bus ■ Completely connected networks
□ Static approach, high costs □ Only direct links, optimal performance
■ Star-connected networks □ Static approach with central switch □ Less links, still very good performance □ Scalability depends on central switch
9
PE PE PE PE
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
PE
PE
PE
PE
PE
PE PE
PE
PE
PE
PE
PE
PE
PE PE
PE
Switch
Interconnection Networks
■ Crossbar switch □ Dynamic switch-based network
□ Supports multiple parallel direct connections without collisions
□ Less edges than completely connected network, but still scalability issues
■ Fat tree □ Use ‘wider’ links in higher parts of the
interconnect tree □ Combine tree design advantages with
solution for root node scalability □ Communication distance between any
two nodes is no more than 2 log #PE’s
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
PE1 PE2 PE3 PEn
PE1
PE2
PE3
PEn
PE PE PE PE PE PE
Switch Switch Switch
Switch Switch
Switch
Interconnection Networks
■ Linear array ■ Ring
□ Linear array with connected endings ■ N-way D-dimensional mesh
□ Matrix of processing elements □ Not more than N neighbor links
□ Structured in D dimensions ■ N-way D-dimensional torus
□ Mesh with “wrap-around” connection
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
PE PE PE PE
PE PE PE PE
PE PE
PE PE
PE PE
PE
PE
PE
PE PE
PE PE
PE PE
PE
PE
PE
07.01.2013
42
Point-to-point networks: "ring and fully connected graph
• Ring has only two connections per PE (almost optimal)
• Fully connected graph – optimal connectivity (but high cost)
83
Mesh and Torus
• Compromise between cost and connectivity
84
4-way 2D torus 8-way 2D mesh 4-way 2D mesh
Example: Blue Gene/Q 5D Torus
■ 5D torus interconnect in Blue Gene/Q supercomputer □ 2 GB/s on all 10 links, 80ns latency to direct neighbors
□ Additional link for communication with I/O nodes
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[IBM
]
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.2: Granularity and Task Mapping
Dr. Peter Tröger + Teaching Team
Workload
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
14
■ Last week showed that task granularity may be flexible □ Example: OpenCL work group size
■ But: Communication overhead becomes significant now □ What is the right level of task granularity ?
Surface-To-Volume Effect
■ Envision the work to be done (in parallel) as sliced 3D cube □ Not a demand on the application
data, just a representation ■ Slicing represents splitting into tasks
■ Computational work of a task □ Proportional to the volume of the cube slice □ Represents the granularity of decomposition
■ Communication requirements of the task □ Proportional to the surface of the cube slice
■ “communication-to-computation” ratio □ Fine granularity: Communication high, computation low □ Coarse granularity: Communication low, computation high
15
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Surface-To-Volume Effect
16
[nic
erw
eb.c
om]
■ Fine-grained decomposition for using all processing elements ?
■ Coarse-grained decomposition to reduce communication overhead ?
■ A tradeoff question !
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Surface-To-Volume Effect
■ Heatmap example with 64 data cells
■ Version (a): 64 tasks □ 64x4=
256 messages, 256 data values
□ 64 processing elements used in parallel
■ Version (b): 4 tasks
□ 16 messages, 64 data values
□ 4 processing elements used in parallel
17
[Fos
ter]
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Surface-To-Volume Effect
■ Rule of thumb □ Agglomerate tasks to avoid communication
□ Stop when parallelism is no longer exploited well enough □ Agglomerate in all dimensions at the same time
■ Influencing factors □ Communication technology + topology
□ Serial performance per processing element □ Degree of application parallelism
■ Task communication vs. network topology □ Resulting task graph must be
mapped to network topology □ Task-to-task communication
may need multiple hops
18
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[Fos
ter]
The Task Mapping Problem
■ Given … □ … a number of homogeneous processing elements
with performance characteristics, □ … some interconnection topology of the processing elements
with performance characteristics, □ … an application dividable into parallel tasks.
■ Questions: □ What is the optimal task granularity ? □ How should the tasks be placed on processing elements ? □ Do we still get speedup / scale-up by this parallelization ?
■ Task mapping is still research, mostly manual tuning today ■ More options with configurable networks / dynamic routing
□ Reconfiguration of hardware communication paths
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.3: Programming with MPI
Dr. Peter Tröger + Teaching Team
Message Passing
■ Parallel programming paradigm for “shared nothing” environments □ Implementations for shared memory available,
but typically not the best approach ■ Users submit their message passing program & data as job
■ Cluster management system creates program instances
Instance 0
Instance 1
Instance 2 Instance
3
Execution Hosts
21
Cluster Management Software
Submission Host
Job
Appli-cation
Single Program Multiple Data (SPMD)
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }
Input data
SPMD program
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
Instance 0
Instance 1
Instance 2
Instance 3
Instance 4
Message Passing Interface (MPI)
■ Many optimized messaging libraries for “shared nothing” environments, developed by networking hardware vendors
■ Need for standardized API solution: Message Passing Interface □ Definition of API syntax and semantics
□ Enables source code portability, not interoperability □ Software independent from hardware concepts
■ Fixed number of process instances, defined on startup □ Point-to-point and collective communication
■ Focus on efficiency of communication and memory usage ■ MPI Forum standard
■ Consortium of industry and academia ■ MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)
23
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
MPI Communicators
■ Each application instance (process) has a rank, starting at zero ■ Communicator: Handle for a group of processes
□ Unique rank numbers inside the communicator group □ Instance can determine communicator size and own rank □ Default communicator MPI_COMM_WORLD □ Instance may be in multiple communicator groups
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Rank 0 Size 4 Rank 1
Size 4
Rank 2 Size 4 Rank 3
Size 4 Com
mun
icat
or
Communication
■ Point-to-point communication between instances int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com); int MPI_Recv(void* buf, int count, MPI_Datatype type, int sourceRank, int tag, MPI_Comm com);
■ Parameters □ Send / receive buffer + size + data type □ Sender provides receiver rank, receiver provides sender rank □ Arbitrary message tag
■ Source / destination identified by [tag, rank, communicator] tuple ■ Default send / receive will block until the match occurs ■ Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST ■ Variations in the API for different buffering behavior
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Example: Ring communication
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }
[mpi
tuto
rial
.com
]
Deadlocks
27 Consider: int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); } ...
If MPI_Send is blocking, there is a deadlock.
int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com);
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Collective Communication
■ Point-to-point communication vs. collective communication ■ Use cases: Synchronization, data distribution & gathering
■ All processes in a (communicator) group communicate together □ One sender with multiple receivers (one-to-all) □ Multiple senders with one receiver (all-to-one) □ Multiple senders and multiple receivers (all-to-all)
■ Typical pattern in supercomputer applications ■ Participants continue if the group communication is done
□ Always blocking operation □ Must be executed by all processes in the group
□ No assumptions on the state of other participants on return
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Barrier
29 ■ Communicator members block until everybody reaches the barrier
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
// … (determine rank and comm_size) … int token; if (rank != 0) {
// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);
} else { // Set the token's value if you are rank 0 token = -1;
} // Send your local token value to your ‘right’ neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1); }
Broadcast
■ int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int rootRank, MPI_Comm comm ) □ rootRank is the rank of the chosen root process □ Root process broadcasts data in buffer to all other processes,
itself included □ On return, all processes have the same data in their buffer
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Data
Proc
esse
s
D0
Data
Proc
esse
s
D0
D0
D0
D0
D0
D0
Broadcast
Scatter
■ int MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
□ sendbuf buffer on root process is divided, parts are sent to all processes, including root
□ MPI_SCATTERV allows varying count of data per rank
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
31
Data
Proc
esse
s
D0 D1 D2 D3 D4 D5
Data
Proc
esse
s
D0
D1
D2
D3
D4
D5
Scatter
Gather
Gather
■ int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm) □ Each process (including the root process) sends the data in its sendbuf buffer to the root process
□ Incoming data in recvbuf is stored in rank order □ recvbuf parameter is ignored for all non-root processes
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Data
Proc
esse
s
D0 D1 D2 D3 D4 D5
Data
Proc
esse
s
D0
D1
D2
D3
D4
D5
Scatter
Gather
Reduction
■ int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int rootRank, MPI_Comm comm)
□ Similar to MPI_Gather □ Additional reduction operation op to aggregate received
data: maximum, minimum, sum, product, boolean operators, max-min, min-min
■ MPI implementation can overlap communication and reduction calculation for faster results
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Data
Proc
esse
s
D0A
D0B
D0C
Reduce ‘+’
Data
Proc
esse
s
D0A ‘+’ D0B ‘+’ D0C
D0B
D0C
Example: MPI_Scatter + MPI_Reduce
34 /* -- E. van den Berg 07/10/2001 -- */!#include <stdio.h>!#include "mpi.h"!!int main (int argc, char *argv[]) { ! int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors! int rank, i = -1, j = -1;!! MPI_Init (&argc, &argv);! MPI_Comm_rank (MPI_COMM_WORLD, &rank);!! MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , ! 1, MPI_INT, 0, MPI_COMM_WORLD);! printf ("[%d] Received i = %d\n", rank, i);! ! MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, ! 0, MPI_COMM_WORLD);!! printf ("[%d] j = %d\n", rank, j);! MPI_Finalize(); ! return 0;!}!
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
What Else
■ Variations: MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall, …
■ Definition of virtual topologies for better task mapping ■ Complex data types
■ Packing / Unpacking (sprintf / sscanf) ■ Group / Communicator Management ■ Error Handling ■ Profiling Interface
■ Several implementations available □ MPICH - Argonne National Laboratory □ OpenMPI - Consortium of Universities and Industry □ ...
35
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.4: Programming with Channels
Dr. Peter Tröger + Teaching Team
Communicating Sequential Processes
■ Formal process algebra to describe concurrent systems □ Developed by Tony Hoare at University of Oxford (1977)
◊ Also inventor of QuickSort and Hoare logic □ Computer systems act and interact with the environment □ Decomposition in subsystems (processes) that operate
concurrently inside the system □ Processes interact with other processes, or the environment
■ Book: T. Hoare, Communicating Sequential Processes, 1985
■ A mathematical theory, described with algebraic laws ■ CSP channel concept available in many programming
languages for “shared nothing” systems ■ Complete approach implemented in Occam language
37
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
CSP: Processes
■ Behavior of real-world objects can be described through their interaction with other objects □ Leave out internal implementation details □ Interface of a process is described as set of atomic events
■ Example: ATM and User, both modeled as processes □ card event – insertion of a credit card in an ATM card slot □ money event – extraction of money from the ATM dispenser
■ Alphabet - set of relevant events for an object description
□ May never happen, interaction is restricted to these events □ αATM = αUser = {card, money}
■ A CSP process is the behavior of an object, described with its alphabet
38
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Communication in CSP
■ Special class of event: Communication □ Modeled as unidirectional channel between processes
□ Channel name is a member of the alphabets of both processes □ Send activity described by multiple c.v events
■ Channel approach assumes rendezvous behavior □ Sender and receiver block on the channel operation until the
message is transmitted □ Implicit barrier based on communication
■ With formal foundation, mathematical proofs are possible □ When two concurrent processes communicate with each other
only over a single channel, they cannot deadlock. □ Network of non-stopping processes which is free of cycles
cannot deadlock. □ …
39
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
What‘s the Deal ?
■ Any possible system can be modeled through event chains □ Enables mathematical proofs for deadlock freedom,
based on the basic assumptions of the formalism (e.g. single channel assumption)
■ Some tools available (check readings page)
■ CSP was the formal base for the Occam language □ Language constructs follow the formalism □ Mathematical reasoning about the behavior of written code
■ Still active research (Welsh University), channel concept frequently adopted □ CSP channel implementations for Java, MPI, Go, C, Python …
□ Other formalisms based on CSP, e.g. Task/Channel model
40
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Channels in Scala
41 actor { var out: OutputChannel[String] = null val child = actor { react { case "go" => out ! "hello" } } val channel = new Channel[String] out = channel child ! "go" channel.receive { case msg => println(msg.length) } }
case class ReplyTo(out: OutputChannel[String]) val child = actor { react { case ReplyTo(out) => out ! "hello" } } actor { val channel = new Channel[String] child ! ReplyTo(channel) channel.receive { case msg => println(msg.length) } }
Scope-based channel sharing
Sending channels in messages
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Channels in Go
42
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
package main import fmt “fmt” func sayHello (ch1 chan string) { ch1 <- “Hello World\n” } func main() { ch1 := make(chan string) go sayHello(ch1) fmt.Printf(<-ch1) } $ 8g chanHello.go $ 8l -o chanHello chanHello.8 $ ./chanHello Hello World $
Concurrent sayHello function
Put value into channel ch1
Program start, create channel
Run sayHello concurrently
Read value from ch1, print it
Compile application
Link application
Run application
Channels in Go
■ select concept allows to switch between available channels □ All channels are evaluated
□ If multiple can proceed, one is chosen randomly □ Default clause if no channel is available
■ Channels are typically first-class language constructs □ Example: Client provides a response channel in the request
■ Popular solution to get deterministic behavior
43
select { case v := <-ch1: fmt.Println("channel 1 sends", v) case v := <-ch2: fmt.Println("channel 2 sends", v) default: // optional fmt.Println("neither channel was ready") }
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Task/Channel Model
■ Computational model for multi-computer by Ian Foster ■ Similar concepts to CSP
■ Parallel computation consists of one or more tasks □ Tasks execute concurrently □ Number of tasks can vary during execution □ Task: Serial program with local memory
□ A task has in-ports and outports as interface to the environment
□ Basic actions: Read / write local memory, send message on outport, receive message on in-port, create new task, terminate
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Task/Channel Model
■ Outport / in-port pairs are connected by channels □ Channels can be created and deleted
□ Channels can be referenced as ports, which can be part of a message
□ Send operation is non-blocking □ Receive operation is blocking □ Messages in a channel stay in order
■ Tasks are mapped to physical processors by the execution environment □ Multiple tasks can be mapped to one processor
■ Data locality is explicit part of the model ■ Channels can model control and data dependencies
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Programming With Channels
■ Channel-only parallel programs have advantages □ Performance optimization does not influence semantics
◊ Example: Shared-memory channels for some parts □ Task mapping does not influence semantics ◊ Align number of tasks for the problem,
not for the execution environment ◊ Improves scalability of implementation
□ Modular design with well-defined interfaces
■ Communication should be balanced between tasks ■ Each task should only communicate with a small group of
neighbors
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.5: Programming with Actors
Dr. Peter Tröger + Teaching Team
Actor Model
■ Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □ Mathematical model for concurrent computation □ Actor as computational primitive
◊ Local decisions, concurrently sends / receives messages ◊ Has a mailbox for incoming messages ◊ Concurrently creates more actors
□ Asynchronous one-way message sending
□ Changing topology allowed, typically no order guarantees ◊ Recipient is identified by mailing address ◊ Actors can send their own identity to other actors
■ Available as programming language extension or library in many environments
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Erlang – Ericsson Language
■ Functional language with actor support ■ Designed for large-scale concurrency
□ First version in 1986 by Joe Armstrong, Ericsson Labs □ Available as open source since 1998
■ Language goals driven by Ericsson product development □ Scalable distributed execution of phone call handling software
with large number of concurrent activities □ Fault-tolerant operation under timing constraints
□ Online software update ■ Users
□ Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile SMS and authentication, Motorola call processing, Ericsson GPRS and 3G mobile network products, CouchDB, EJabberD, …
49
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Concurrency in Erlang
■ Concurrency Oriented Programming □ Actor processes are completely independent (shared nothing)
□ Synchronization and data exchange with message passing □ Each actor process has an unforgeable name □ If you know the name, you can send a message □ Default approach is fire-and-forget
□ You can monitor remote actor processes ■ Using this gives you …
□ Opportunity for massive parallelism □ No additional penalty for distribution, despite latency issues
□ Easier fault tolerance capabilities □ Concurrency by default
50
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Actors in Erlang
■ Communication via message passing is part of the language ■ Send never fails, works asynchronously (PID ! Message)
■ Actors have mailbox functionality □ Queue of received messages, selective fetching □ Only messages from same source arrive in-order □ receive statement with set of clauses, pattern matching
□ Process is suspended in receive operation until a match receive Pattern1 when Guard1 -> expr1, expr2, ..., expr_n; Pattern2 when Guard2 -> expr1, expr2, ..., expr_n; Other -> expr1, expr2, ..., expr_n end
51
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Functions exported + #args
Erlang Example: Ping Pong Actors
52
Start Ping and Pong actors
Blocking recursive receive, scanning the mailbox
Ping actor, sending message to Pong
Blocking recursive receive, scanning the mailbox
Sending message to Ping
[erlan
g.or
g]
-module(tut15). -export([test/0, ping/2, pong/0]). ping(0, Pong_PID) -> Pong_PID ! finished, io:format("Ping finished~n", []); ping(N, Pong_PID) -> Pong_PID ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping(N - 1, Pong_PID). pong() -> receive finished -> io:format("Pong finished~n", []); {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() end. test() -> Pong_PID = spawn(tut15, pong, []), spawn(tut15, ping, [3, Pong_PID]).
Pong actor
Actors in Scala
■ Actor-based concurrency in Scala, similar to Erlang ■ Concurrency abstraction on top of threads or processes
■ Communication by non-blocking send operation and blocking receive operation with matching functionality actor { var sum = 0 loop { receive { case Data(bytes) => sum += hash(bytes) case GetSum(requester) => requester ! sum }}}
■ All constructs are library functions (actor, loop, receiver, !) ■ Alternative self.receiveWithin() call with timeout ■ Case classes act as message type representation
53
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Case classes, acting as message types
Start the counter actor
Scala Example: Counter Actor
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
import scala.actors.Actor import scala.actors.Actor._ case class Inc(amount: Int) case class Value class Counter extends Actor { var counter: Int = 0; def act() = { while (true) { receive { case Inc(amount) => counter += amount case Value => println("Value is "+counter) exit() }}}} object ActorTest extends Application { val counter = new Counter counter.start() for (i <- 0 until 100000) { counter ! Inc(1) } counter ! Value // Output: Value is 100000 }
Send an Inc message to the counter actor
Send a Value message to the counter actor
Implementation of the counter actor
Blocking receive loop, scanning the mailbox
Actor Deadlocks
55 ■ Synchronous send operator „!?“ available in Scala □ Sends a message and blocks in receive afterwards
□ Intended for request-response pattern
■ Original asynchronous send makes deadlocks less probable
[htt
p://
sava
nne.
be/a
rtic
les/
conc
urre
ncy-
in-e
rlan
g-sc
ala/
] // actorA actorB !? Msg1(value) match { case Response1(r) => // … } receive { case Msg2(value) => reply(Response2(value)) }
// actorB actorA !? Msg2(value) match { case Response2(r) => // … } receive { case Msg1(value) => reply(Response1(value)) }
// actorA actorB ! Msg1(value) while (true) { receive { case Msg2(value) => reply(Response2(value)) case Response1(r) => // ... }}
// actorB actorA ! Msg2(value) while (true) { receive { case Msg1(value) => reply(Response1(value)) case Response2(r) => // ... }}
Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.6: Programming with MapReduce
Dr. Peter Tröger + Teaching Team
MapReduce
■ Programming model for parallel processing of large data sets □ Inspired by map() and reduce() in functional programming
□ Intended for best scalability in data parallelism ■ Huge interest started with Google Research publication
□ Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters“
□ Google products rely on internal implementation ■ Apache Hadoop: Widely known open source implementation
□ Scales to thousands of nodes □ Has shown to process petabytes of data □ Cluster infrastructure with custom file system (HDFS)
■ Parallel programming on very high abstraction level
57
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
MapReduce Concept
■ Map step □ Convert input tuples [key, value] with map() function into one / multiple intermediate tuples [key2, value2] per input
■ Shuffle step: Collect all intermediate tuples with the same key ■ Reduce step
□ Combine all intermediate tuples with the same key by some reduce() function to one result per key
■ Developer just defines stateless map() and reduce() functions ■ Framework automatically ensures parallelization ■ Persistence layer needed for input and output only
58
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[dev
elop
ers.
goog
le.c
om]
Example: Character Counting
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Java Example: Hadoop Word Count
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);
}}} public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}...}
[had
oop.
apac
he.o
rg]
MapReduce Data Flow
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
[dev
elop
er.y
ahoo
.com
]
Advantages
■ Developer never implements communication or synchronization, implicitly done by the framework □ Allows transparent fault tolerance and optimization
■ Running map and reduce tasks are stateless
□ Only rely on their input, produce their own output □ Repeated execution in case of failing nodes □ Redundant execution for compensating nodes with different
performance characteristics ■ Scale-out only limited by
□ Distributed file system performance (input / output data)
□ Shuffle step communication performance ■ Chaining of map/reduce tasks is very common in practice ■ But: Demands embarrassingly parallel problem
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Summary: Week 5
■ “Shared nothing” systems provide very good scalability □ Adding new processing elements not limited by “walls”
□ Different options for interconnect technology ■ Task granularity is essential
□ Surface-to-volume effect □ Task mapping problem
■ De-facto standard is MPI programming ■ High level abstractions with
□ Channels □ Actors
□ MapReduce
63
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
„What steps / strategy would you apply to parallelize a given compute-intense program? “