20
Evaluating the Performance Limitations of MPMD Communication Chi-Chao Chang Dept. of Computer Science Cornell University Grzegorz Czajkowski (Cornell) Thorsten von Eicken (Cornell) Carl Kesselman (ISI/USC)

Evaluating the Performance Limitations of MPMD Communication

  • Upload
    fell

  • View
    16

  • Download
    0

Embed Size (px)

DESCRIPTION

Evaluating the Performance Limitations of MPMD Communication. Chi-Chao Chang Dept. of Computer Science Cornell University Grzegorz Czajkowski (Cornell) Thorsten von Eicken (Cornell) Carl Kesselman (ISI/USC). Framework. Parallel computing on clusters of workstations - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluating the Performance Limitations of MPMD Communication

Evaluating the Performance Limitations of MPMD

Communication

Chi-Chao ChangDept. of Computer Science

Cornell University

Grzegorz Czajkowski (Cornell)

Thorsten von Eicken (Cornell)

Carl Kesselman (ISI/USC)

Page 2: Evaluating the Performance Limitations of MPMD Communication

Framework

Parallel computing on clusters of workstations Hardware communication primitives are message-based Programming models: SPMD and MPMD SPMD is the predominant model

Why use MPMD ? appropriate for distributed, heterogeneous setting: metacomputing parallel software as “components”

Why use RPC ? right level of abstraction message passing requires receiver to know when to expect

incoming communication

Systems with similar philosophy: Nexus, Legion

How do RPC-based MPMD systems perform on homogeneous MPPs?

2

Page 3: Evaluating the Performance Limitations of MPMD Communication

Problem

MPMD systems are an order of magnitude slower than SPMD systems on homogeneous MPPs

1. Implementation: trade-off: existing MPMD systems focus on the general case

at expense of performance in the homogeneous case

2. RPC is more complex when the SPMD assumption is dropped.

3

Page 4: Evaluating the Performance Limitations of MPMD Communication

Approach

MRPC: an MPMD RPC system specialized for MPPs best base-line RPC performance at the expense of heterogeneity start from simple SPMD RPC: Active Messages “minimal” runtime system for MPMD integrate with a MPMD parallel language: CC++ no modifications to front-end translator or back-end compiler

Goal is to introduce only the necessary RPC runtime overheads for MPMD

Evaluate it w.r.t. a highly-tuned SPMD system Split-C over Active Messages

4

Page 5: Evaluating the Performance Limitations of MPMD Communication

MRPC

Implementation Library: RPC, basic types marshalling, remote program

execution about 4K lines of C++ and 2K lines of C Implemented on top of Active Messages (SC ‘96)

“dispatcher” handler Currently runs on the IBM SP2 (AIX 3.2.5)

Integrated into CC++: relies on CC++ global pointers for RPC binding borrows RPC stub generation from CC++ no modification to front-end compiler

5

Page 6: Evaluating the Performance Limitations of MPMD Communication

Outline

Design issues in MRPC MRPC and CC++ Performance results

6

Page 7: Evaluating the Performance Limitations of MPMD Communication

Method Name Resolution

Compiler cannot determine the existence or location of a remote procedure statically

7

SPMD: same program image MPMD: needs mapping

foo:&foo

foo:foo:

“foo”“foo” &foo

. . .

. . .

&foo

MRPC: sender-side stub address caching

Page 8: Evaluating the Performance Limitations of MPMD Communication

Stub address caching

8

&e_foo4

e_foo:

dispatcher “e_foo” &e_foo

. . .

. . . p addr

“e_foo”

$“e_foo” &e_foo

“e_foo”

miss

1

2

3

Cold Invocation:

Hot Invocation:

GP

&e_foohit

e_foo:

dispatcher

p addr

“e_foo”

$

GP

Page 9: Evaluating the Performance Limitations of MPMD Communication

Argument Marshalling

Arguments of RPC can be arbitrary objects must be marshalled and unmarshalled by RPC stubs even more expensive in heterogeneous setting

versus… AM: up to 4 4-byte arguments, arbitrary buffers (programmer

takes care of marshalling)

MRPC: efficient data copying routines for stubs

9

Page 10: Evaluating the Performance Limitations of MPMD Communication

Data Transfer

Caller stub does not know about the receive buffer no caller/callee synchronization

versus… AM: caller specifies remote buffer address

MRPC: Efficient buffer management and persistent receive buffers

10

Page 11: Evaluating the Performance Limitations of MPMD Communication

Persistent Receive Buffers

11

S-buf

Persistent R-buf

Static, per-node buffer

Persistent R-buf&R-buf is stored

in the cache

e_foo

Dispatcher

&R-buf

1 2

3

$

copy

S-buf

Data is sent to static buffer

Data is sent directly to R-buf

Cold Invocation:

Hot Invocation:e_foo

Page 12: Evaluating the Performance Limitations of MPMD Communication

Threads

Each RPC requires a new (logical) thread at the receiving end No restrictions on operations performed in remote

procedures Runtime system must be thread safe

versus… Split-C: single thread of control per node

MRPC: custom, non-preemptive threads package

12

Page 13: Evaluating the Performance Limitations of MPMD Communication

Message Reception

Message reception is not receiver-initiated Software interrupts: very expensive

versus… MPI: several different ways to receive a message (poll, post,

etc) SPMD: user typically identifies comm phases into which

cheap polling can be introduced easily

MRPC: Polling thread

13

Page 14: Evaluating the Performance Limitations of MPMD Communication

CC++ over MRPC

14

gpA->foo(p,i);

(endpt.InitRPC(gpA, “entry_foo”),

endpt << p, endpt << i,

endpt.SendRPC(),

endpt >> retval,

endpt.Reset());

global class A {

. . . };

double A::foo(int p, int i) {

. . .}

A::entry_foo(. . .) {

. . .

endpt.RecvRPC(inbuf, . . . );

endpt >> arg1; endpt >> arg2;

double retval = foo(arg1, arg2);

endpt << retval;

endpt.ReplyRPC();

. . . }

MRPC Interface•InitRPC•SendRPC•RecvRPC•ReplyRPC•Reset

CC++: caller

compiler

C++ caller stub

CC++: calleeC++ callee stub

compiler

Page 15: Evaluating the Performance Limitations of MPMD Communication

Null RPC:AM: 55 us

CC++/MRPC: 87 us

Nexus/MPL: 240 μs (DCE: ~50 μs)

Global pointer read/write (8 bytes)Split-C/AM: 57 μs

CC++/MRPC: 92 μs

Bulk read (160 bytes)Split-C/AM: 74 μs

CC++/MRPC: 154 μs

IBM MPI-F and MPL (AIX 3.2.5): 88 us

Basic comm costs in CC++/MRPC are within 2x with Split-C/AM and other messaging layers

Micro-benchmarks

15

1.02.1

1.01.6

1.0

4.41.6

Page 16: Evaluating the Performance Limitations of MPMD Communication

Applications

16

App Split-C/AM CC++/Nexus CC++/MRPC

em3d-ghost 800

6.9 s 464 s (67.2x) 16.9 s (2.4x)

water-pref512 mol

0.75 s 12.3 s (16.4x) 2.6 s (3.4x)

FFT 1M 0.78 s 23.1 s (29.6x) 2.8 s (3.6x)

LU 512 0.81 s 15.5 s (19.1x) 2.9 s (3.6x)

3 versions of EM3D, 2 versions of Water, LU and FFT CC++ versions based on original Split-C code Runs taken for 4 and 8 processors on IBM SP-2

Page 17: Evaluating the Performance Limitations of MPMD Communication

Water

17

0.00

1.00

2.00

3.00

4.00S

C-4

CC

-4

SC

-8

CC

-8

SC

-4

CC

-4

SC

-8

CC

-8

marsh+copy

thread sync

thread mgmt

net

cpu

Atomic 512 Prefetch 512

5.58 4.84 3.50 3.44

Page 18: Evaluating the Performance Limitations of MPMD Communication

Discussion

CC++ applications perform within a factor of 2 to 6 of Split-C order of magnitude improvement over previous impl

Method name resolution constant cost, almost negligible in apps

Threads accounts for ~25-50% of the gap, including:

synchronization (~15-35% of the gap) due to thread safety thread management (~10-15% of the gap), 75% context switches

Argument Marshalling and Data Copy large fraction of the remaining gap (~50-75%) opportunity for compiler-level optimizations

18

Page 19: Evaluating the Performance Limitations of MPMD Communication

Related Work

Lightweight RPC LRPC: RPC specialization for local case

High-Performance RPC in MPPs Concert, pC++, ABCL

Integrating threads with communication Optimistic Active Messages Nexus

Compiling techniques Specialized frame mgmt and calling conventions, lazy

threads, etc. (Taura’s PLDI ‘97)

19

Page 20: Evaluating the Performance Limitations of MPMD Communication

Conclusion

Possible to implement an RPC-based MPMD system that is competitive with SPMD systems on homogeneous MPPs

same order of magnitude performance trade-off between generality and performance

Questions remaining: scalability for larger number of nodes integration with heterogeneous runtime infrastructure

Slides: http://www.cs.cornell.edu/home/chichao

MRPC, CC++ apps source code: [email protected]

20