Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon

Distributed Shared Memory:A Survey of Issues and Algorithms

B,. Nitzberg and V. LoUniversity of Oregon

INTRODUCTION

• Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space

Why bother with DSM?

• Key idea is to build fast parallel computers that are– Cheaper than shared memory multiprocessor

architectures– As convenient to use

CPU

Shared memory

Conventional parallel architecture

CACHE CACHE CACHE CACHE

CPU CPU CPU

Today’s architecture

• Clusters of workstations are much more cost effective– No need to develop complex bus and cache

structures– Can use off-the-shelf networking hardware

• Gigabit Ethernet • Myrinet (1.5 Gb/s)

– Can quickly integrate newest microprocessors

Limitations of cluster approach

• Communication within a cluster of workstation is through message passing– Much harder to program than concurrent

access to a shared memory• Many big programs were written for shared

memory architectures– Converting them to a message passing

architecture is a nightmare

Distributed shared memory

DSM = one shared global address space

main memories

Distributed shared memory

• DSM makes a cluster of workstations look like a shared memory parallel computer– Easier to write new programs– Easier to port existing programs

• Key problem is that DSM only provides the illusion of having a shared memory architecture– Data must still move back and forth among

the workstations

Basic approaches

• Hardware implementations:– Use extensions of traditional hardware

caching architecture• Operating system/library implementations:

– Use virtual memory mechanisms• Compiler implementations

– Compiler handles all shared accesses

Design Issues (I)

1. Structure and granularity– Big units are more efficient

• Virtual memory pages– Can have false sharing whenever page

contains different variables that are accessed at the same time by different processors

False Sharing

accesses x accesses y

x y

page containing x and y will move back and forthbetween main memories of workstations

Design Issues (II)

1. Structure and granularity (cont'd)– Shared objects can also be

• Objects from a distributed object-oriented system

• Data types from an extant language

Design Issues (III)

2. Coherence semantics– Strict consistency is not possible– Various authors have proposed weaker

consistency models• Cheaper to implement• Harder to use in a correct fashion

Design Issues (IV)

3. Scalability– Possibly very high but limited by

• Central bottlenecks• Global knowledge operation and storage

Design Issues (V)

4. Heterogeneity– Possible but complex to implement

Portability Issues

• Portability of programs– Some DSMs allow programs written for a

multiprocessor architecture to run on a cluster of workstations without any modifications (dusty decks)

– More efficient DSMs require more changes• Portability of DSM

– Some DSMs require specific OS features

Not in paper

Implementation Issues (I)

1. Data Location and Access:• Keep data a single centralized location • Let data migrate (better) but must have way to

locate them• Centralized server (bottleneck)• Have a "home" node associated with

each piece of data • Will keep track of its location

Implementation Issues (II)

1. Data Location and Access (cont'd):• Can either

• Maintain a single copy of each piece of data• Replicate it on demand

• Must either• Propagate updates to all replicas• Use an invalidation protocol

Invalidation protocol

• Before update:

• At update time

X = 0 X = 0 X = 0

X = 5 X = 0 X = 0INVALID INVALID

Main advantage

• Locality of updates:– A page that is being modified has a high

likelihood of being modified again• Invalidation mechanism minimizes consistency

overhead– One single invalidation replaces many

updates

A realization: Munin

• Developed at Rice University• Based on software objects (variables)• Used the processor virtual memory to detect

access to the shared objects• Included several techniques for reducing

consistency-related communication• Only ran on top of the V kernel

Munin main strengths

• Excellent performance • Portability of programs

– Allowed programs written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)

Munin main weakness

• Very poor portability of Munin itself– Depended of some features of the V kernel

• Not maintained since the late 80's

Consistency model

• Munin uses software release consistency– Only requires the memory to be consistent at

specific synchronization points

SW release consistency (I)

• Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables– P(&mutex) and V(&mutex)– lock(&csect) and unlock(&csect) – acquire( ) and release( )

• Unprotected accesses can produce unpredictable results

SW release consistency (II)

• SW release consistency will only guarantee correctness of operations performed within a request/release pair

• No need to export the new values of shared variables until the release

• Must guarantee that workstation has received the most recent values of all shared variables when it completes a request

SW release consistency (III)

shared int x;acquire( );

x = 1;release ( );// export x=1

shared int x;

acquire( );// wait for new value of x

x++;release ( );// export x=2

SW release consistency (IV)

• Must still decide how to release updated values– Munin uses eager release:

• New values of shared variables were propagated at release time

SW release consistency (V)

Eagerrelease

Each release forwards the update to the two other processors.

Multiple write protocol

• Designed to fight false sharing• Uses a copy-on-write mechanism• Whenever a process is granted access to write-

shared data, the page containing these data is marked copy-on-write

• First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).

Creating a twin Not in paper

x = 1

y = 2

x = 1

y = 2

First write access

twin

x = 3

y = 2

Before

After

Compare with twinNew value of x is 3

Example Not in paper

Other DSM Implementations (I)

• Software release consistency with lazy release (Treadmarks)– Faster and designed to be portable

• Sequentially-Consistent Software DSM (IVY):– Sends messages to other copies at each write– Much slower

Other DSM Implementations (II)

• Entry consistency (Midway):– Requires each variable to be associated to a

synchronization object (typically a lock)– Acquire/release operations on a given

synchronization object only involve the variables associated with that object

– Requires less data traffic– Does not handle well dusty decks

Other DSM Implementations (III)

• Structured DSM Systems (Linda):– Offer to the programmer a shared tuple space

accessed using specific synchronized methods

– Require a very different programming style

TODAY'S IMPACT

• Very low:– According to W. Zwaepoel. truth is that

computer clusters are "only suitable for coarse-grained parallel computation" and this is "[a] fortiori true for DSM"

– DSM competed with OpenMP model and OPenMP model won

Documents

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon