COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to

1

COSC 6374

Parallel Computation

Remote Direct Memory Access

Edgar Gabriel

Fall 2015

Communication Models

P0 P1

A B

send receive

Message Passing Model

P0 P1

A B put

Remote Memory Access

P0 P1

A B

A=B

Shared Memory Model

2

Data Movement

CPU Mem

NIC

CPU

NIC

Mem

Message Passing Model:

• Two-sided communication

CPU Mem

NIC

CPU

NIC

Mem

Remote Memory Access:

• One-sided communication

Remote Direct Memory Access

• Direct Memory Access (DMA) allows data to be sent

directly from an attached device to the memory on the

computer's motherboard.

• One CPU is freed from involvement with the data

transfer, thus speeding up overall computer operation

• Remote Direct Memory Access (RDMA): two or more

computers communicate directly from the main

memory of one system to the main memory of another

3

One-sided communication in MPI

• MPI-2 defines one-sided communication:

– A process can put some data into the main memory of another process (MPI_Put)

– A process can get some data from the main memory of another process (MPI_Get)

– A process can perform some operations on a data item in the main memory of another process (MPI_Accumulate)

• Target process not actively involved in the

communication

RDMA in MPI

• Problems:

– How can a process define which part of its main memory

are available for RDMA?

– How can a process define when this part of the main

memory is available for RDMA?

– How can a process define who is allowed to access its

memory?

– How can a process define which elements in a remote

memory it wants to access?

4

The window concept of MPI-2 (I)

• An MPI_Win defines the group of process allowed to access a certain memory area

• Arguments:

– base: Starting address for the public memory region

– size: size of the public memory area in bytes

– disp_unit: offset from the base address in bytes

– info: Hint to the MPI how the window will be used (e.g. only reading or only writing)

– comm: communicator defining the group of processes allowed to access the memory window

MPI_Win_create(void *base, MPI_Aint size, int

disp_unit, MPI_Info info, MPI_Comm comm,

MPI_Win *win);

The window concept of MPI-2 (II)

• Definition of a temporal window:

– Access Epoch: time slot in which a process accesses

remote memory of another process

– Exposure Epoch: time slot in which a process allows

access to a memory window by other processes

• Does a process have control when other processes are

accessing its memory window?

– yes: active target communication

– no: passive target communication

5

Active Target Communication (I)

• Synchronization of all operations within a window

– collective across all processes of win

– No difference between access and exposure epoch

– Starts or closes an access and exposure epoch

• Arguments

– assert: Hint to the library on the usage (default: 0)

MPI_Win_fence ( int assert, MPI_Win win);

Data exchange (I)

• A single process controls the data parameters of both

processes

• Put data described by (oaddr, ocount, otype)

into the main memory of the process defined by Rank rank in the window win at the position

(base+disp*disp_unit,tcount,ttype)

– base and disp_unit have been defined in MPI_Win_create

– Value of base and disp_unit not known by the

process calling MPI_Put!

MPI_Put (void *oaddr, int ocount, MPI_Datatype otype, int rank, MPI_Aint disp, int tcount, MPI_Datatype ttype, MPI_Win win);

6

21 3050 xx

Example: Ghost-cell update

4

3

2

1

4

3

2

1

5020

305020

305020

3050

rhs

rhs

rhs

rhs

x

x

x

x

321 305020 xxx

3432 305020 rhsxxx

443 5020 rhsxx

1rhs

2rhs

Process 0

Process 1

Process 0 needs x3

Process 1 needs x2

Parallel Matrix-vector multiply for band-matrices

Example: Ghost-cell update (II)

• Ghost cells: (read-only) copy of elements held by another process

• Ghost-cells for 2-D matrices: additional row of data

3x1x

Process 0

2x

Process 1

4x3x2x

Process 0

Process 1

Process 2

nxlocal

nxlocal

nxlocal

ny

7

• Data structure: u[i][j] is stored in a matrix

• Extent of variable u

• with containing the local data

:

:

y

xlocal

n

n no of data points in x direction

no of data points in y direction

]][2[ yxlocal nnu

]1:0][:1[ yxlocal nnu

Example: Ghost-cell update (III)

Example: Ghost-cell update (IV)

MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),

0, MPI_INFO_NULL, &win);

…

MPI_Win_fence ( 0, win);

MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,

(nxlocal+1)*ny*sizeof(double), ny, MPI_DOUBLE, win);

MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);

MPI_Win_fence ( 0, win);

…

MPI_Win_free ( &win);

8

Comments to the example

• Modifications to the data items might only be visible

after closing the corresponding epochs

– No guarantee whether the data item is really transfered during MPI_Put or during MPI_Win_fence

• If multiple processes modify the very same memory

address at the very same process, no guarantees are

given on which data item will be visible.

– Responsibility of the user to get it right

Passive Target Communication

• MPI_Win_lock starts an access epoch to access the main

memory of the process with rank rank

• All RDMA operations between a lock/unlock appear atomic

• lock_type: MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED

• Update to the local memory exposed through the MPI window should also happen using MPI_Win_lock/MPI_Put

– Otherwise undefined access order/race condition

between local update and RDMA access

MPI_Win_lock (int lock_type, int rank, int assert,

MPI_Win win);

MPI_Win_unlock (int rank, MPI_Win win);

9

Example: Ghost-cell update (V)

MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double),

0, MPI_INFO_NULL, &win);

…

MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank-1, 0, win);

MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1,

(nxlocal+1)*ny*sizeof(double), ny,

MPI_DOUBLE, win);

MPI_Win_unlock( rank-1, win);

MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank+1, 0, win);

MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win);

MPI_Win_unlock ( rank+1, win);

One-sided vs. Two-sided communication

• One-sided communication doesn’t need

– message matching

– unexpected message queues

– Uses only one processor

potentially faster!

• One-sided communication in MPI can optimize potentially

– multiple transactions

– between multiple processes

10

Limitations of the MPI-2 model

• Synchronization costs (e.g. MPI_Win_fence) can be

significant

• Static model

– Size of memory window can not be altered after creating an MPI_Win

– Difficult to support dynamic data structures such as a

linked list

• Passive target model has limited usability

– But that is what most other RDMA libraries focus on

• In MPI-3:

– Introduction of dynamic windows

– Extending the functionality passive target operations

Use case: distributed linked list

• A linked list maintained across multiple processes

– E.g. after a global sort operation of all elements

– E.g. having fixed rules for the keys

rank 0: keys which start with ‘a’ to ‘d’

rank 1: keys which start with ‘e’ to ‘h’ …

Rank 0 Rank 1 Rank 2

11

Use case: Distributed linked list

typedef struct{

char key[MAX_KEY_SIZE];

char value[MAX_VALUE_SIZE];

MPI_Aint next_disp;

int next_rank;

void *next_local; // next local element

} ListElem;

// Create an MPI data type describing this

// structure using MPI_Type_create_struct. Not shown

// here for brevity

Equivalent to the next pointer

in a non-distributed linked list

Traversing a distributed linked list ListElem local_copy, *current;

ListElem *head; //assumed to be already set

current=head;

MPI_Win_lock_all ( win );

while (!found ) {

if ( current->next_rank != myrank ) {

MPI_Get (&local_copy, 1, ListElem_type,

current->next_rank, current->next_disp,

1, ListElem_type, win );

MPI_Win_flush ( current->next_rank, win );

current = &local_copy;

} else

current = current->next_local;

if ( strcmp(current->key, key ) == 0 )

break;

}

MPI_Win_unlock_all( win);

Get a shared (read-only) lock to

all processes that are part of win

Enforce the completion of all

pending operations to a process

without having to release the lock(s)

12

Inserting elements into a linked list

• Assuming that only a local process is allowed to insert an

element (e.g. after a global sort operation)

– Remote processes only allowed to read elements on other

processes

• Requires dynamically allocating memory and extending a

memory region

• A dynamic window defines only the participating group of process

– More than one memory region can be assigned to a single window

MPI_Win_create_dynamic( MPI_Info info, MPI_Comm comm,

MPI_Win *win);

MPI_Win_attach (MPI_Win win, void *base, MPI_Aint size);

Inserting elements into a linked list (II) // create window instance once

MPI_Win_create_dynamic (MPI_INFO_NULL, comm, &win);

// insert each element into the memory window

t = (ListElem *) malloc ( sizeof (ListElem) );

t->key = strdup (key);

t->value = strdup (value);

current = find_prev_element (head, key, value)

t2 = current->next_local;

current->next_local = t;

t->next_local = t2;

MPI_Win_attach ( win, t, sizeof(ListElem );

…

// add another element

t = (ListElem *) malloc ( sizeof (ListElem) );

…

MPI_Win_attach ( win, t, sizeof(ListElem );

MPI_Barrier (comm);

Similarly for updating next_rank and

next_disp on current and t

Documents

COSC 6374 Parallel Computation Remote Direct Memory Accessgabriel/courses/cosc6374_f15/ParCo_24_RDMA.pdf · Remote Direct Memory Access •Direct Memory Access (DMA) allows data to