New York University MILAN
NYU
Bridging the Gap Between Distributed Shared Memory and
Message Passing
Holger Karl
Department of Computer Science
Courant Institute of Mathematical Sciences
New York University
Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft
New York University MILAN
NYUOverview
Motivation
Related approaches
Charlotte
Annotating Charlotte
Some experiments
Summary
New York University MILAN
NYUMotivation
Scenario Distributed computing using the World Wide Web Use Java to overcome heterogeneity, security concerns and
program installation problems Idea: volunteer computing
– Applets can participate in distributed applications
Challenges Only make use of standard Web browsers/Java virtual machines Consider the Web’s and Java’s idiosyncrasies
– Inherent faultiness, machines come and go
– Host-of-origin imposes a star-like communication topology
Provide a simple programming interface, e.g., DSM No hardware support (e.g., page protection) available Do not forget efficiency
New York University MILAN
NYURelated Approaches
Standard browser Programminginterface
Fault tolerance
JPVM No Message passing Possible
ATLAS No Work stealing Yes
JavaParty Yes (with RMI) Remote objects No
ParaWeb Yes Message Passing NoNo DSM No
Java/DSM No DSM No
Javelin Yes Infrastructuresupport
Possible
Charlotte(PDCS ’96)
Yes DSM Yes
New York University MILAN
NYUCharlotte - Programs
Alternating sequential and parallel steps Sequential steps
executed by a manager application
Parallel steps executed by worker applets
Routines are defined in parallel steps
Routines are methods of objects, derived from a Charlotte class Droutine
public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … }}
New York University MILAN
NYUCharlotte - Memory Semantics
Memory partitioned into private and shared parts Shared memory with CRCW-common semantics Implemented at object level
Charlotte provides classes for every primitive type: int Dint, float Dfloat, etc
Manager has master copy of memory Shared objects have get() and set() methods
– get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager
– set() marks data as modified to be flushed to manager at end of parallel step
Updates from different routines incorporated atomically at end of parallel step
Allows eager scheduling for fault tolerance Routines can be executed multiple times without violating exactly
once semantics
New York University MILAN
NYUCharlotte - Matrix Multiplication
public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } …}
New York University MILAN
NYUCharlotte - Pros and Cons
Pros Simple programming model Well-defined DSM semantics Adaptive parallelism Fault tolerant with respect to worker crashes
Cons Efficiency problems
– Shared data is accessed via method invocations
– Loading data from manager incurs latency
– Choosing good page size is difficult
– Sending modified data back to manager requires inspection of status flags
Reason: runtime system does not know which data is read or modified by routines
Make this information explicit in Charlotte programs!
New York University MILAN
NYUAnnotations
Specify read and write
locations for routines
Use methods of class
Droutine
dloc_read()
dloc_write()
First step: Correctness-
preserving annotations
Only send read data at
beginning of routine
public class Matrix extends Droutine {
public void drun (int n, int i) {…}
public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } …}
New York University MILAN
NYURelying on Annotations
Second step: Rely on annotations Data transfer between manager and worker happens only
according to annotations No if-statement in get() necessary No status update in set() necessary No latencies for requesting data updates But, wrong annotations lead to wrong results
Use Uint (unchecked) instead of Dint, etc. Identical interface Remaining overhead: method invocation for data access
Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed
Simple to switch back and forth
New York University MILAN
NYUSharing Primitive Types
Third step: Use primitive types (int) instead of objects (Uint) Possible since get() and set() are trivial for Uint
Possible since annotations describe data movement completely
Avoids method invocation overhead
But: different interface– No get()/set() syntactic changes
– Call-by-reference / call-by-value
Particularly suited to arrays of distributed types
Freely mix Dint, Uint and shared int
New York University MILAN
NYUAdditional Optimizations
Manager keeps track of workers’ valid data sets Use routines’ read sets to select routine to give to worker
Choose routine with minimal amount of missing data Example: 3 routines, 2 workers
Data can be cached at workers between parallel steps Requires data to be declared unmodified at the beginning of
parallel step
Worker 1
Worker 2
valid: 1-100
valid: 150-200
read_set:20-80
90-150
160-210
New York University MILAN
NYUDiscussion
Gradual incorporation of semantic knowledge into Charlotte programs Large flexibility to mix different levels of annotation semantics in one
program Easy to switch back and forth among these levels Close to message passing behavior Charlotte’s nice properties are preserved (fault tolerance, adaptive
parallelism) Is it efficient?
pureDSM
DSM+hints
DSM+unchecked
objects
DSM+shared
primitives
annotationswith correct-ness check
correctness-assuming
annotations
primitive typesinstead of objects
Messagepassing
programtransformation
New York University MILAN
NYUAnnotating DSM Code
Munin Annotating data with expected access patterns, system chooses
appropriate consistency protocols
Aurora Object-based system in C++, different consistency models are
dynamically selected at runtime
CRL, Cid Library of C-functions, explicit mapping of shared memory in local
memory
Jade Annotations used to extract parallelism from sequential code
All lack flexibility to mix different guarantee levels
New York University MILAN
NYUExperiments - Setup
Setup: Multiplication of 200x200 integer matrices
PentiumPro 200 at NYU with FastEthernet (100MBit/sec)
Pentium 90 at Humboldt University Berlin
Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux
Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200)
Ping between NYU and HUB: typically 130 msec.
New York University MILAN
NYUExperiments
Multiply two 200x200 integer matrices
Runtimes for Standard Charlotte (Dint) Standard Charlotte plus
correctness-preserving annotations (Dint+A)
Unchecked classes (Uint) Shared primitive types (int) A message passing
implementation (mes.pas.)
Shared primitive types competitive with message passing
Reasonable absolute speedups (compared to sequential runtime)
0
5
10
15
20
25
1 2 3 4No. of Workers
Tim
e (s
ec.)
Dint Dint+A Uint int mes.pas.
0
0.5
1
1.5
2
2.5
1 2 3 4No. of Workers
Ab
solu
te S
pee
du
p
Dint Dint+A Uint int mes.pas.
New York University MILAN
NYUExperiments (contd.)
Ratios between various optimizations
Annotations/Presending give a factor of about three
Getting rid of objects is another factor of two
Altogether: up to nine times faster than standard Charlotte
And better scalability
0
2
4
6
8
10
1 2 3 4No. of Workers
Rat
io
Dint/Dint+A Dint/Uint Dint/int
New York University MILAN
NYUExperiments (contd.)
Manager at NYU, worker at HU Berlin
Similar behavior, if slightly smaller improvement
0
20
40
60
80
100
1 2No. of Workers
Tim
e (
sec.
)
Dint Dint+A int mes.pas.
01234567
1 2No. of Workers
Rat
io
Dint/Dint+A Dint/int Dint/mes.pas.
New York University MILAN
NYUExperiments (contd.)
Colocation and Caching Example: multiply matrix A
by two different matrices B1 and B2
Small impact in a LAN environment
Up to 25% improvement for high-latency connections
Colocation results in smaller standard deviation of runtimes
024
68
1012
141618
NYU (comp.) NYU Berlin (comp.) Berlin
Worker Location
Tim
e (s
ec.)
Dint+Ann. Caching Colocation
New York University MILAN
NYUFuture Work
Investigating more elaborate examples Data with both regular and irregular access patterns
Compiler-generated annotations
New programming interface no nested parallelism
parBegin() / parEnd() makes some annotations awkward
Overlapping communication/computation Multiple worker threads in one virtual machine
Applying annotations to Calypso (a page-based DSM system)
Down the road: use annotations for QoS
New York University MILAN
NYUSummary
Annotations to describe read and write sets of parallel routines Correctness-preserving
Correctness-sensitive
Or directly sharing primitive types
Possible to mix these different levels in a single program and to switch back and forth between them
Big flexibility for programmer
Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance
Performance competitive with message passing programs