Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming...

Lecture 8: Distributed memory

David Bindel

22 Sep 2011

Logistics

I Interesting CS colloquium tomorrow at 4:15 in Upson B15:“Trumping the Multicore Memory Hierarchy with Hi-Spade”

I Proj 1 due tomorrow at 11:59! I will be gone after 2-3tomorrow, so get in your questions now.(Alternate suggestion: Monday at 11:59?)

I Suggest a two-fold strategy: work on fast kernel (say16-by-16), and then work on a good blocked code thatemploys that kernel. Be careful not to spend too much timeon optimizations that the compiler does better.

I When you submit your Makefile, please make sure thatyou copy any changes that you made to the flags inMakefile.in into the Makefile proper.

I Some good discussions on Piazza – keep it going!

Next HW: a particle dynamics simulation.

Plan for this week

I Last week: shared memory programmingI Shared memory HW issues (cache coherence)I Threaded programming concepts (pthreads and OpenMP)I A simple example (Monte Carlo)

I This week: distributed memory programmingI Distributed memory HW issues (topologies, cost models)I Message-passing programming concepts (and MPI)I A simple example (“sharks and fish”)

Basic questions

How much does a message cost?I Latency: time to get between processorsI Bandwidth: data transferred per unit timeI How does contention affect communication?

This is a combined hardware-software question!

We want to understand just enough for reasonable modeling.

Thinking about interconnects

Several features characterize an interconnect:I Topology: who do the wires connect?I Routing: how do we get from A to B?I Switching: circuits, store-and-forward?I Flow control: how do we manage limited resources?

Thinking about interconnects

I Links are like streetsI Switches are like intersectionsI Hops are like blocks traveledI Routing algorithm is like a travel planI Stop lights are like flow controlI Short packets are like cars, long ones like buses?

At some point the analogy breaks down...

Bus topology

$ $ $ $

P0 P1 P2 P3

I One set of wires (the bus)I Only one processor allowed at any given time

I Contention for the bus is an issueI Example: basic Ethernet, some SMPs

Crossbar

P0 P1 P2 P3

I Dedicated path from every input to every outputI Takes O(p2) switches and wires!

I Example: recent AMD/Intel multicore chips(older: front-side bus)

Bus vs. crossbar

I Crossbar: more hardwareI Bus: more contention (less capacity?)I Generally seek happy medium

I Less contention than busI Less hardware than crossbarI May give up one-hop routing

Network properties

Think about latency and bandwidth via two quantities:I Diameter: max distance between nodesI Bisection bandwidth: smallest bandwidth cut to bisect

I Particularly important for all-to-all communication

Linear topology

I p − 1 linksI Diameter p − 1I Bisection bandwidth 1

Ring topology

I p linksI Diameter p/2I Bisection bandwidth 2

I May be more than two dimensionsI Route along each dimension in turn

Torus : Mesh :: Ring : Linear

Hypercube

I Label processors with binary numbersI Connect p1 to p2 if labels differ in one bit

Fat tree

I Processors at leavesI Increase link bandwidth near root

Others...

I Butterfly networkI Omega networkI Cayley graph

Current picture

I Old: latencies = hopsI New: roughly constant latency (?)

I Wormhole routing (or cut-through) flattens latencies vsstore-forward at hardware level

I Software stack dominates HW latency!I Latencies not same between networks (in box vs across)I May also have store-forward at library level

I Old: mapping algorithms to topologiesI New: avoid topology-specific optimization

I Want code that runs on next year’s machine, too!I Bundle topology awareness in vendor MPI libraries?I Sometimes specify a software topology

α-β model

Crudest model: tcomm = α+ βMI tcomm = communication timeI α = latencyI β = inverse bandwidthI M = message size

Works pretty well for basic guidance!

Typically α� β � tflop. More money on network, lower α.

LogP model

Like α-β, but includes CPU time on send/recv:I Latency: the usualI Overhead: CPU time to send/recvI Gap: min time between send/recvI P: number of processors

Assumes small messages (gap ∼ bw for fixed message size).

Communication costs

Some basic goals:I Prefer larger to smaller messages (avoid latency)I Avoid communication when possible

I Great speedup for Monte Carlo and other embarrassinglyparallel codes!

I Overlap communication with computationI Models tell you how much computation is needed to mask

communication costs.

Message passing programming

Basic operations:I Pairwise messaging: send/receiveI Collective messaging: broadcast, scatter/gatherI Collective computation: sum, max, other parallel prefix opsI Barriers (no need for locks!)I Environmental inquiries (who am I? do I have mail?)

(Much of what follows is adapted from Bill Gropp’s material.)

I Message Passing InterfaceI An interface spec — many implementationsI Bindings to C, C++, Fortran

Hello world

#include <mpi.h>#include <stdio.h>

int main(int argc, char** argv) {int rank, size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);printf("Hello from %d of %d\n", rank, size);MPI_Finalize();return 0;

Communicators

I Processes form groupsI Messages sent in contexts

I Separate communication for librariesI Group + context = communicatorI Identify process by rank in groupI Default is MPI_COMM_WORLD

Sending and receiving

Need to specify:I What’s the data?

I Different machines use different encodings (e.g.endian-ness)

I =⇒ “bag o’ bytes” model is inadequateI How do we identify processes?I How does receiver identify messages?I What does it mean to “complete” a send/recv?

MPI datatypes

Message is (address, count, datatype). Allow:I Basic types (MPI_INT, MPI_DOUBLE)I Contiguous arraysI Strided arraysI Indexed arraysI Arbitrary structures

Complex data types may hurt performance?

MPI tags

Use an integer tag to label messagesI Help distinguish different message typesI Can screen messages with wrong tagI MPI_ANY_TAG is a wildcard

MPI Send/Recv

Basic blocking point-to-point communication:

intMPI_Send(void *buf, int count,

MPI_Datatype datatype,int dest, int tag, MPI_Comm comm);

intMPI_Recv(void *buf, int count,

MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status);

MPI send/recv semantics

I Send returns when data gets to systemI ... might not yet arrive at destination!

I Recv ignores messages that don’t match source and tagI MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards

I Recv status contains more info (tag, source, size)

Ping-pong pseudocode

Process 0:

for i = 1:ntrialssend b bytes to 1recv b bytes from 1

Process 1:

for i = 1:ntrialsrecv b bytes from 0send b bytes to 0

Ping-pong MPI

void ping(char* buf, int n, int ntrials, int p){

for (int i = 0; i < ntrials; ++i) {MPI_Send(buf, n, MPI_CHAR, p, 0,

MPI_COMM_WORLD);MPI_Recv(buf, n, MPI_CHAR, p, 0,

MPI_COMM_WORLD, NULL);}

(Pong is similar)

Ping-pong MPI

for (int sz = 1; sz <= MAX_SZ; sz += 1000) {if (rank == 0) {

clock_t t1, t2;t1 = clock();ping(buf, sz, NTRIALS, 1);t2 = clock();printf("%d %g\n", sz,

(double) (t2-t1)/CLOCKS_PER_SEC);} else if (rank == 1) {

pong(buf, sz, NTRIALS, 0);}

Running the code

On my laptop (OpenMPI)

mpicc -std=c99 pingpong.c -o pingpong.xmpirun -np 2 ./pingpong.x

Details vary, but this is pretty normal.

Approximate α-β parameters (2-core laptop)

1.00e-06

2.00e-06

3.00e-06

4.00e-06

5.00e-06

6.00e-06

7.00e-06

8.00e-06

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

MeasuredModel

α ≈ 1.46× 10−6, β ≈ 3.89× 10−10

Where we are now

Can write a lot of MPI code with 6 operations we’ve seen:I MPI_Init

I MPI_Finalize

I MPI_Comm_size

I MPI_Comm_rank

I MPI_Send

I MPI_Recv

... but there are sometimes better ways.

Next time: non-blocking and collective operations!

Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming...

Documents

Better Messaging: Bringing the Customer's Perspective to your Messaging

Mobile messaging

TRICARE Online Patient Portal - Secure Messaging€¦ · Secure Messaging What is TRICARE® Online (TOL) Patient Portal Secure Messaging? TOL Patient Portal Secure Messaging (SM)

Message from the Programme Director & Headdhenkanalcollege.nic.in/pdf/Information_Brochure_2016.pdfMessage from the Programme Director & Head Management and Computer Education are

Unified Messaging 2.3 User Guide (for Enterprise Messaging)

Frontier Legacy Voice Messaging Matrixwholesale.frontier.com/...voice-messaging-matrix.pdf · Voice Messaging Matrix Revised: 03/13/13 Page 10 Frontier Residential Basic Messaging

My HealtheVet Secure Messaging Brochure...Secure Messaging.* Once logged in, you can access Secure Messaging by: • Selecting the Secure Messaging tab which appears across the top

DFI to DFI Messaging - NACHA Messaging... · Other Uses for DFI to DFI Messaging. 20. DFI to DFI Messaging Anticipated Benefits. receipt

TECHNICAL Kickstart Your Digital CX Transformation with ... · Messaging – Messaging Platforms Coming soon to the Messaging channel is social messaging platforms. End customers

HL7 Messaging Tool Group - ciisresources.com Messaging Too… · 2 April 2016 HL7 Messaging Tool – Group Getting ... click on the Validate Group HL7 Messaging icon to test messages

Avaya Aura® Messaging 6 - · PDF file– Avaya Aura® Messaging 6.1 – Modular Messaging 5.2 – Communication Manager (CM) Messaging 6.0

Lecture 19: Graph Partitioning - Cornell Universitybindel/class/cs5220-f11/slides/lec19.pdf · Inertialbisection Idea: Optimizecuttinghyperplanebasedonvertexdensity x = 1 n Xn i=1

Hybridized Parallelization of NGA - GitHub Pageschuhang.github.io/files/cs5220/final.pdf · Hybridized Parallelization of NGA Hang Chuy, Lingnan Liuy, Jilong Wuy, Mohammed Sameed

Messaging:1 Service - upnp.orgupnp.org/specs/phone/UPnP-phone-Messaging-v1-Service.pdf · 2.3 Messaging Service Architecture The Messaging Messaging The Messaging Messaging service

Messaging. Objectives Understand the basics of messaging-based communication – what messaging is – how messaging works Recognize when to use it – what

Device to Device Messaging Using Google Cloud Messaging GCM

Oracle Accelerate Messaging Guide in template v6.5download.oracle.com/accelerate/Messaging_Guide_template_v6.5.pdf · Oracle Accelerate Messaging Roadmap 5 Messaging Structure: Core

Mobile Messaging M&A: Consolidation in Mobile Messaging Space

Messaging as a service : building a scalable messaging service

Administering the WebLogic Messaging Bridge for Oracle ......Configuring Messaging Bridges for High Availability2-6 Designing a Messaging Bridge When to use a Messaging Bridge3-1 Store