OpenHPI - Parallel Programming Concepts - Week 5

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.1: Hardware

Dr. Peter Tröger + Teaching Team

Summary: Week 4

■  Accelerators enable major speedup for data parallelism □  SIMD execution model (no branching)

□  Memory latency managed with many light-weight threads ■  Tackle diversity with OpenCL

□  Loop parallelism with index ranges □  Kernels in C, compiled at runtime

□  Complex memory hierarchy supported ■  Getting fast is easy, getting faster is hard

□  Best practices for accelerators □  Hardware knowledge needed

What if my computational problem still demands more power?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Parallelism for …

■  Speed – compute faster ■  Throughput – compute more in the same time

■  Scalability – compute faster / more with additional resources □  Huge scalability only with shared nothing systems □  Still also depends on application characteristics

Processing Element A1

Processing Element B1

Processing Element B2

Processing Element B3 Sca

Scaling Out

Parallel Hardware

■  Shared memory system □  Typically a single machine, common address space for tasks

□  Hardware scaling is limited (power / memory wall) ■  Shared nothing (distributed memory) system □  Tasks on multiple machines, can only access local memory □  Global task coordination by explicit messaging

□  Easy scale-out by adding machines to the network

Processing Element

Shared Memory

Processing Element

Cache Cache Local

Memory Local

Memory

Parallel Hardware

■  Shared memory system à collection of processors □  Integrated machine for capacity computing

□  Prepared for a large variety of problems ■  Shared-nothing system à collection of computers

□  Clusters and supercomputers for capability computing □  Installation to solve few problems in the best way

□  Parallel software must be able leverage multiple machines at the same time

□  Difference to distributed systems (Internet, Cloud) ◊  Single organizational domain, managed as a whole ◊  Single parallel application at a time,

no separation of client and server application ◊ Hybrids are possible (e.g. HPC in Amazon AWS cloud)

Shared Nothing: Clusters

■  Collection of stand-alone machines connected by a local network □  Cost-effective technique for a large-scale parallel computer

□  Users are builders, have control over their system □  Synchronization much slower than in shared memory □  Task granularity becomes an issue

Processing Element

Local Memory

Shared Nothing: Supercomputers

■  Supercomputers / Massively Parallel Processing (MPP) systems □  (Hierarchical) cluster with a lot of processors

□  Still standard hardware, but specialized setup □  High-performance interconnection network □  For massive data-parallel applications, mostly simulations

(weapons, climate, earthquakes, airplanes, car crashes, ...) ■  Examples (Nov 2013)

□  BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops

□  Tianhe-2, 3.1 million cores, 1 PB memory, 17.808 kW power, 33.86 PFlops (quadrillions calculations per second)

■  Annual ranking with the TOP500 list (www.top500.org)

Example

IBM System Technology Group

1. Chip:16+2 !P

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency

Blue Gene/Q

Interconnection Networks

■  Bus systems □  Static approach, low costs

□  Shared communication path, broadcasting of information

□  Scalability issues with shared bus ■  Completely connected networks

□  Static approach, high costs □  Only direct links, optimal performance

■  Star-connected networks □  Static approach with central switch □  Less links, still very good performance □  Scalability depends on central switch

PE PE PE PE

Switch

■  Crossbar switch □  Dynamic switch-based network

□  Supports multiple parallel direct connections without collisions

□  Less edges than completely connected network, but still scalability issues

■  Fat tree □  Use ‘wider’ links in higher parts of the

interconnect tree □  Combine tree design advantages with

solution for root node scalability □  Communication distance between any

two nodes is no more than 2 log #PE’s

PE1 PE2 PE3 PEn

PE PE PE PE PE PE

Switch Switch Switch

Switch Switch

Switch

■  Linear array ■  Ring

□  Linear array with connected endings ■  N-way D-dimensional mesh

□  Matrix of processing elements □  Not more than N neighbor links

□  Structured in D dimensions ■  N-way D-dimensional torus

□  Mesh with “wrap-around” connection

PE PE PE PE

07.01.2013

Point-to-point networks: "ring and fully connected graph

• Ring has only two connections per PE (almost optimal)

• Fully connected graph – optimal connectivity (but high cost)

Mesh and Torus

• Compromise between cost and connectivity

4-way 2D torus 8-way 2D mesh 4-way 2D mesh

Example: Blue Gene/Q 5D Torus

■  5D torus interconnect in Blue Gene/Q supercomputer □  2 GB/s on all 10 links, 80ns latency to direct neighbors

□  Additional link for communication with I/O nodes

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.2: Granularity and Task Mapping

Workload

■  Last week showed that task granularity may be flexible □  Example: OpenCL work group size

■  But: Communication overhead becomes significant now □  What is the right level of task granularity ?

Surface-To-Volume Effect

■  Envision the work to be done (in parallel) as sliced 3D cube □  Not a demand on the application

data, just a representation ■  Slicing represents splitting into tasks

■  Computational work of a task □  Proportional to the volume of the cube slice □  Represents the granularity of decomposition

■  Communication requirements of the task □  Proportional to the surface of the cube slice

■  “communication-to-computation” ratio □  Fine granularity: Communication high, computation low □  Coarse granularity: Communication low, computation high

■  Fine-grained decomposition for using all processing elements ?

■  Coarse-grained decomposition to reduce communication overhead ?

■  A tradeoff question !

■  Heatmap example with 64 data cells

■  Version (a): 64 tasks □  64x4=

256 messages, 256 data values

□  64 processing elements used in parallel

■  Version (b): 4 tasks

□  16 messages, 64 data values

□  4 processing elements used in parallel

■  Rule of thumb □  Agglomerate tasks to avoid communication

□  Stop when parallelism is no longer exploited well enough □  Agglomerate in all dimensions at the same time

■  Influencing factors □  Communication technology + topology

□  Serial performance per processing element □  Degree of application parallelism

■  Task communication vs. network topology □  Resulting task graph must be

mapped to network topology □  Task-to-task communication

may need multiple hops

The Task Mapping Problem

■  Given … □  … a number of homogeneous processing elements

with performance characteristics, □  … some interconnection topology of the processing elements

with performance characteristics, □  … an application dividable into parallel tasks.

■  Questions: □  What is the optimal task granularity ? □  How should the tasks be placed on processing elements ? □  Do we still get speedup / scale-up by this parallelization ?

■  Task mapping is still research, mostly manual tuning today ■  More options with configurable networks / dynamic routing

□  Reconfiguration of hardware communication paths

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.3: Programming with MPI

Message Passing

■  Parallel programming paradigm for “shared nothing” environments □  Implementations for shared memory available,

but typically not the best approach ■  Users submit their message passing program & data as job

■  Cluster management system creates program instances

Instance 0

Instance 1

Instance 2 Instance

Execution Hosts

Cluster Management Software

Submission Host

Appli-cation

Single Program Multiple Data (SPMD)

// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }

Input data

SPMD program

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }

Instance 0

Instance 1

Instance 2

Instance 3

Instance 4

Message Passing Interface (MPI)

■  Many optimized messaging libraries for “shared nothing” environments, developed by networking hardware vendors

■  Need for standardized API solution: Message Passing Interface □  Definition of API syntax and semantics

□  Enables source code portability, not interoperability □  Software independent from hardware concepts

■  Fixed number of process instances, defined on startup □  Point-to-point and collective communication

■  Focus on efficiency of communication and memory usage ■  MPI Forum standard

■  Consortium of industry and academia ■  MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)

MPI Communicators

■  Each application instance (process) has a rank, starting at zero ■  Communicator: Handle for a group of processes

□  Unique rank numbers inside the communicator group □  Instance can determine communicator size and own rank □  Default communicator MPI_COMM_WORLD □  Instance may be in multiple communicator groups

Rank 0 Size 4 Rank 1

Size 4

Rank 2 Size 4 Rank 3

Size 4 Com

Communication

■  Point-to-point communication between instances int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com); int MPI_Recv(void* buf, int count, MPI_Datatype type, int sourceRank, int tag, MPI_Comm com);

■  Parameters □  Send / receive buffer + size + data type □  Sender provides receiver rank, receiver provides sender rank □  Arbitrary message tag

■  Source / destination identified by [tag, rank, communicator] tuple ■  Default send / receive will block until the match occurs ■  Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST ■  Variations in the API for different buffering behavior

Example: Ring communication

// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }

Deadlocks

27 Consider: int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); } ...

If MPI_Send is blocking, there is a deadlock.

int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com);

Collective Communication

■  Point-to-point communication vs. collective communication ■  Use cases: Synchronization, data distribution & gathering

■  All processes in a (communicator) group communicate together □  One sender with multiple receivers (one-to-all) □  Multiple senders with one receiver (all-to-one) □  Multiple senders and multiple receivers (all-to-all)

■  Typical pattern in supercomputer applications ■  Participants continue if the group communication is done

□  Always blocking operation □  Must be executed by all processes in the group

□  No assumptions on the state of other participants on return

Barrier

29 ■  Communicator members block until everybody reaches the barrier

MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)

Broadcast

■  int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int rootRank, MPI_Comm comm ) □  rootRank is the rank of the chosen root process □  Root process broadcasts data in buffer to all other processes,

itself included □  On return, all processes have the same data in their buffer

Broadcast

Scatter

■  int MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

□  sendbuf buffer on root process is divided, parts are sent to all processes, including root

□  MPI_SCATTERV allows varying count of data per rank

D0 D1 D2 D3 D4 D5

Scatter

Gather

■  int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm) □  Each process (including the root process) sends the data in its sendbuf buffer to the root process

□  Incoming data in recvbuf is stored in rank order □  recvbuf parameter is ignored for all non-root processes

D0 D1 D2 D3 D4 D5

Scatter

Gather

Reduction

■  int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int rootRank, MPI_Comm comm)

□  Similar to MPI_Gather □  Additional reduction operation op to aggregate received

data: maximum, minimum, sum, product, boolean operators, max-min, min-min

■  MPI implementation can overlap communication and reduction calculation for faster results

Reduce ‘+’

D0A ‘+’ D0B ‘+’ D0C

Example: MPI_Scatter + MPI_Reduce

34 /* -- E. van den Berg 07/10/2001 -- */!#include <stdio.h>!#include "mpi.h"!!int main (int argc, char *argv[]) { ! int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors! int rank, i = -1, j = -1;!! MPI_Init (&argc, &argv);! MPI_Comm_rank (MPI_COMM_WORLD, &rank);!! MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , ! 1, MPI_INT, 0, MPI_COMM_WORLD);! printf ("[%d] Received i = %d\n", rank, i);! ! MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, ! 0, MPI_COMM_WORLD);!! printf ("[%d] j = %d\n", rank, j);! MPI_Finalize(); ! return 0;!}!

What Else

■  Variations: MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall, …

■  Definition of virtual topologies for better task mapping ■  Complex data types

■  Packing / Unpacking (sprintf / sscanf) ■  Group / Communicator Management ■  Error Handling ■  Profiling Interface

■  Several implementations available □  MPICH - Argonne National Laboratory □  OpenMPI - Consortium of Universities and Industry □  ...

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.4: Programming with Channels

Communicating Sequential Processes

■  Formal process algebra to describe concurrent systems □  Developed by Tony Hoare at University of Oxford (1977)

◊  Also inventor of QuickSort and Hoare logic □  Computer systems act and interact with the environment □  Decomposition in subsystems (processes) that operate

concurrently inside the system □  Processes interact with other processes, or the environment

■  Book: T. Hoare, Communicating Sequential Processes, 1985

■  A mathematical theory, described with algebraic laws ■  CSP channel concept available in many programming

languages for “shared nothing” systems ■  Complete approach implemented in Occam language

CSP: Processes

■  Behavior of real-world objects can be described through their interaction with other objects □  Leave out internal implementation details □  Interface of a process is described as set of atomic events

■  Example: ATM and User, both modeled as processes □  card event – insertion of a credit card in an ATM card slot □  money event – extraction of money from the ATM dispenser

■  Alphabet - set of relevant events for an object description

□  May never happen, interaction is restricted to these events □  αATM = αUser = {card, money}

■  A CSP process is the behavior of an object, described with its alphabet

Communication in CSP

■  Special class of event: Communication □  Modeled as unidirectional channel between processes

□  Channel name is a member of the alphabets of both processes □  Send activity described by multiple c.v events

■  Channel approach assumes rendezvous behavior □  Sender and receiver block on the channel operation until the

message is transmitted □  Implicit barrier based on communication

■  With formal foundation, mathematical proofs are possible □  When two concurrent processes communicate with each other

only over a single channel, they cannot deadlock. □  Network of non-stopping processes which is free of cycles

cannot deadlock. □  …

What‘s the Deal ?

■  Any possible system can be modeled through event chains □  Enables mathematical proofs for deadlock freedom,

based on the basic assumptions of the formalism (e.g. single channel assumption)

■  Some tools available (check readings page)

■  CSP was the formal base for the Occam language □  Language constructs follow the formalism □  Mathematical reasoning about the behavior of written code

■  Still active research (Welsh University), channel concept frequently adopted □  CSP channel implementations for Java, MPI, Go, C, Python …

□  Other formalisms based on CSP, e.g. Task/Channel model

Channels in Scala

41 actor { var out: OutputChannel[String] = null val child = actor { react { case "go" => out ! "hello" } } val channel = new Channel[String] out = channel child ! "go" channel.receive { case msg => println(msg.length) } }

case class ReplyTo(out: OutputChannel[String]) val child = actor { react { case ReplyTo(out) => out ! "hello" } } actor { val channel = new Channel[String] child ! ReplyTo(channel) channel.receive { case msg => println(msg.length) } }

Scope-based channel sharing

Sending channels in messages

Channels in Go

package main import fmt “fmt” func sayHello (ch1 chan string) { ch1 <- “Hello World\n” } func main() { ch1 := make(chan string) go sayHello(ch1) fmt.Printf(<-ch1) } $ 8g chanHello.go $ 8l -o chanHello chanHello.8 $ ./chanHello Hello World $

Concurrent sayHello function

Put value into channel ch1

Program start, create channel

Run sayHello concurrently

Read value from ch1, print it

Compile application

Link application

Run application

Channels in Go

■  select concept allows to switch between available channels □  All channels are evaluated

□  If multiple can proceed, one is chosen randomly □  Default clause if no channel is available

■  Channels are typically first-class language constructs □  Example: Client provides a response channel in the request

■  Popular solution to get deterministic behavior

select { case v := <-ch1: fmt.Println("channel 1 sends", v) case v := <-ch2: fmt.Println("channel 2 sends", v) default: // optional fmt.Println("neither channel was ready") }

Task/Channel Model

■  Computational model for multi-computer by Ian Foster ■  Similar concepts to CSP

■  Parallel computation consists of one or more tasks □  Tasks execute concurrently □  Number of tasks can vary during execution □  Task: Serial program with local memory

□  A task has in-ports and outports as interface to the environment

□  Basic actions: Read / write local memory, send message on outport, receive message on in-port, create new task, terminate

Task/Channel Model

■  Outport / in-port pairs are connected by channels □  Channels can be created and deleted

□  Channels can be referenced as ports, which can be part of a message

□  Send operation is non-blocking □  Receive operation is blocking □  Messages in a channel stay in order

■  Tasks are mapped to physical processors by the execution environment □  Multiple tasks can be mapped to one processor

■  Data locality is explicit part of the model ■  Channels can model control and data dependencies

Programming With Channels

■  Channel-only parallel programs have advantages □  Performance optimization does not influence semantics

◊  Example: Shared-memory channels for some parts □  Task mapping does not influence semantics ◊  Align number of tasks for the problem,

not for the execution environment ◊  Improves scalability of implementation

□  Modular design with well-defined interfaces

■  Communication should be balanced between tasks ■  Each task should only communicate with a small group of

neighbors

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.5: Programming with Actors

Actor Model

■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Mathematical model for concurrent computation □  Actor as computational primitive

◊  Local decisions, concurrently sends / receives messages ◊ Has a mailbox for incoming messages ◊ Concurrently creates more actors

□  Asynchronous one-way message sending

□  Changing topology allowed, typically no order guarantees ◊  Recipient is identified by mailing address ◊  Actors can send their own identity to other actors

■  Available as programming language extension or library in many environments

Erlang – Ericsson Language

■  Functional language with actor support ■  Designed for large-scale concurrency

□  First version in 1986 by Joe Armstrong, Ericsson Labs □  Available as open source since 1998

■  Language goals driven by Ericsson product development □  Scalable distributed execution of phone call handling software

with large number of concurrent activities □  Fault-tolerant operation under timing constraints

□  Online software update ■  Users

□  Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile SMS and authentication, Motorola call processing, Ericsson GPRS and 3G mobile network products, CouchDB, EJabberD, …

Concurrency in Erlang

■  Concurrency Oriented Programming □  Actor processes are completely independent (shared nothing)

□  Synchronization and data exchange with message passing □  Each actor process has an unforgeable name □  If you know the name, you can send a message □  Default approach is fire-and-forget

□  You can monitor remote actor processes ■  Using this gives you …

□  Opportunity for massive parallelism □  No additional penalty for distribution, despite latency issues

□  Easier fault tolerance capabilities □  Concurrency by default

Actors in Erlang

■  Communication via message passing is part of the language ■  Send never fails, works asynchronously (PID ! Message)

■  Actors have mailbox functionality □  Queue of received messages, selective fetching □  Only messages from same source arrive in-order □  receive statement with set of clauses, pattern matching

□  Process is suspended in receive operation until a match receive Pattern1 when Guard1 -> expr1, expr2, ..., expr_n; Pattern2 when Guard2 -> expr1, expr2, ..., expr_n; Other -> expr1, expr2, ..., expr_n end

Functions exported + #args

Erlang Example: Ping Pong Actors

Start Ping and Pong actors

Blocking recursive receive, scanning the mailbox

Ping actor, sending message to Pong

Blocking recursive receive, scanning the mailbox

Sending message to Ping

[erlan

-module(tut15). -export([test/0, ping/2, pong/0]). ping(0, Pong_PID) -> Pong_PID ! finished, io:format("Ping finished~n", []); ping(N, Pong_PID) -> Pong_PID ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping(N - 1, Pong_PID). pong() -> receive finished -> io:format("Pong finished~n", []); {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() end. test() -> Pong_PID = spawn(tut15, pong, []), spawn(tut15, ping, [3, Pong_PID]).

Pong actor

Actors in Scala

■  Actor-based concurrency in Scala, similar to Erlang ■  Concurrency abstraction on top of threads or processes

■  Communication by non-blocking send operation and blocking receive operation with matching functionality actor { var sum = 0 loop { receive { case Data(bytes) => sum += hash(bytes) case GetSum(requester) => requester ! sum }}}

■  All constructs are library functions (actor, loop, receiver, !) ■  Alternative self.receiveWithin() call with timeout ■  Case classes act as message type representation

Case classes, acting as message types

Start the counter actor

Scala Example: Counter Actor

import scala.actors.Actor import scala.actors.Actor._ case class Inc(amount: Int) case class Value class Counter extends Actor { var counter: Int = 0; def act() = { while (true) { receive { case Inc(amount) => counter += amount case Value => println("Value is "+counter) exit() }}}} object ActorTest extends Application { val counter = new Counter counter.start() for (i <- 0 until 100000) { counter ! Inc(1) } counter ! Value // Output: Value is 100000 }

Send an Inc message to the counter actor

Send a Value message to the counter actor

Implementation of the counter actor

Blocking receive loop, scanning the mailbox

Actor Deadlocks

55 ■  Synchronous send operator „!?“ available in Scala □  Sends a message and blocks in receive afterwards

□  Intended for request-response pattern

■  Original asynchronous send makes deadlocks less probable

] // actorA actorB !? Msg1(value) match { case Response1(r) => // … } receive { case Msg2(value) => reply(Response2(value)) }

// actorB actorA !? Msg2(value) match { case Response2(r) => // … } receive { case Msg1(value) => reply(Response1(value)) }

// actorA actorB ! Msg1(value) while (true) { receive { case Msg2(value) => reply(Response2(value)) case Response1(r) => // ... }}

// actorB actorA ! Msg2(value) while (true) { receive { case Msg1(value) => reply(Response1(value)) case Response2(r) => // ... }}

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.6: Programming with MapReduce

MapReduce

■  Programming model for parallel processing of large data sets □  Inspired by map() and reduce() in functional programming

□  Intended for best scalability in data parallelism ■  Huge interest started with Google Research publication

□  Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters“

□  Google products rely on internal implementation ■  Apache Hadoop: Widely known open source implementation

□  Scales to thousands of nodes □  Has shown to process petabytes of data □  Cluster infrastructure with custom file system (HDFS)

■  Parallel programming on very high abstraction level

MapReduce Concept

■  Map step □  Convert input tuples [key, value] with map() function into one / multiple intermediate tuples [key2, value2] per input

■  Shuffle step: Collect all intermediate tuples with the same key ■  Reduce step

□  Combine all intermediate tuples with the same key by some reduce() function to one result per key

■  Developer just defines stateless map() and reduce() functions ■  Framework automatically ensures parallelization ■  Persistence layer needed for input and output only

Example: Character Counting

Java Example: Hadoop Word Count

public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);

}}} public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}...}

MapReduce Data Flow

Advantages

■  Developer never implements communication or synchronization, implicitly done by the framework □  Allows transparent fault tolerance and optimization

■  Running map and reduce tasks are stateless

□  Only rely on their input, produce their own output □  Repeated execution in case of failing nodes □  Redundant execution for compensating nodes with different

performance characteristics ■  Scale-out only limited by

□  Distributed file system performance (input / output data)

□  Shuffle step communication performance ■  Chaining of map/reduce tasks is very common in practice ■  But: Demands embarrassingly parallel problem

Summary: Week 5

■  “Shared nothing” systems provide very good scalability □  Adding new processing elements not limited by “walls”

□  Different options for interconnect technology ■  Task granularity is essential

□  Surface-to-volume effect □  Task mapping problem

■  De-facto standard is MPI programming ■  High level abstractions with

□  Channels □  Actors

□  MapReduce

„What steps / strategy would you apply to parallelize a given compute-intense program? “

OpenHPI - Parallel Programming Concepts - Week 5

Education

OpenHPI 4.8 - Resolution (FOL)

Basics Concepts of Parallel Programming

OpenHPI 6.7 - Semantic Search

OpenHPI 5.1 - Description Logics - ALC

OpenHPI 4.4 - Ontology Types

OpenHPI - Parallel Programming Concepts - Week 6

OpenHPI - Parallel Programming Concepts - Week 1

OpenHPI 6.3 - Ontology Design 101

Parallel Programming Concepts Software Programming Models ... · Parallel Programming Concepts Software Programming Models - PGAS, Functional and Actor Programming Peter Tröger Sources:

Parallel Programming Concepts

Parallel Concepts

Parallel processing Concepts

Parallel Programming Concepts - Lehigh Universityalp514/hpc2017/parprog.pdf · Parallel Programming Concepts 2017 HPC Workshop: Parallel Programming Alexander B. Pacheco LTS Research

OpenHPI - Parallel Programming Concepts - Week 2

OpenHPI 4.6 - Canonical Form

OpenHPI 6.2 - Ontology Design

OpenHPI 6.5 - Linked Data Engineering (Part 2)

Concepts of Parallel and Distributed Processing

Parallel Computing—Higher-level concepts of MPI

OpenHPI 1.2 - The Limits of the Web