30
March 19, 200 3 1 2003 Michigan Technological University Steven Seidel Steven Seidel Department of Computer Science Department of Computer Science Michigan Technological University Michigan Technological University [email protected] [email protected]

2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University [email protected]

Embed Size (px)

Citation preview

Page 1: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 12003 Michigan Technological University

Steven SeidelSteven SeidelDepartment of Computer ScienceDepartment of Computer Science

Michigan Technological UniversityMichigan Technological University

[email protected]@mtu.edu

Page 2: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 22003 Michigan Technological University

OverviewOverview BackgroundBackground

Collective operations in the UPC languageCollective operations in the UPC language

The V1.0 UPC collectives specificationThe V1.0 UPC collectives specification

Relocalization operationsRelocalization operations

Computational operationsComputational operations

Performance and implementation issuesPerformance and implementation issues

ExtensionsExtensions

Other workOther work

Page 3: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 32003 Michigan Technological University

BackgroundBackground

UPC is an extension of C that provides a partitioned shared UPC is an extension of C that provides a partitioned shared memory programming model.memory programming model.

The V1.1 UPC spec was adopted on March 25.The V1.1 UPC spec was adopted on March 25.

Processes in UPC are called Processes in UPC are called threadsthreads..

Each thread has a private (local) address space.Each thread has a private (local) address space.

All threads share a global address space that is partitioned All threads share a global address space that is partitioned among the threads.among the threads.

A shared object that resides in thread A shared object that resides in thread ii’s partition is said to have ’s partition is said to have affinityaffinity to thread to thread ii..

If thread If thread ii has affinity to a shared object has affinity to a shared object xx, it is expected that , it is expected that accesses to accesses to xx take less time than accesses to shared objects to take less time than accesses to shared objects to which thread which thread ii does not have affinity. does not have affinity.

Page 4: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 42003 Michigan Technological University

UPC programming modelUPC programming model

shared

A[0]=7;

7

local

th0 th1 th2

shared [5] int A[10*THREADS];

0

15

105

20 25

int i;

i ii

i=3;

3

A[i]=A[0]+2;

9

Page 5: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 52003 Michigan Technological University

Collective operations in UPCCollective operations in UPC

If any thread calls a If any thread calls a collectivecollective function, then all threads must function, then all threads must also call that function.also call that function.

Collectives arguments are Collectives arguments are single-valuedsingle-valued: corresponding : corresponding function arguments have the same value.function arguments have the same value.

V1.1 UPC contains several collective functions:V1.1 UPC contains several collective functions: upc_notify upc_notify andand upc_wait upc_wait upc_barrierupc_barrier upc_all_allocupc_all_alloc upc_all_lock_allocupc_all_lock_alloc

These collectives provide synchronization and memory These collectives provide synchronization and memory allocation across all threads.allocation across all threads.

Page 6: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 62003 Michigan Technological University

shared void *upc_all_alloc(nblocks, nbytes);shared void *upc_all_alloc(nblocks, nbytes);

This function allocates This function allocates shared [nbytes] char[nblocks*nbytes]shared [nbytes] char[nblocks*nbytes]

shared

localp p p

th0 th1 th2

p=upc_all_alloc(4,5);p=upc_all_alloc(4,5);

0

15

105

p=upc_all_alloc(4,5);

shared [5] char *p;

Page 7: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 72003 Michigan Technological University

The V1.0 UPC Collectives SpecThe V1.0 UPC Collectives Spec

First draft by Wiebel and Greenberg, March 2002.First draft by Wiebel and Greenberg, March 2002.

Spec discussed at May, 2002, and SC’02 UPC workshops.Spec discussed at May, 2002, and SC’02 UPC workshops.

Many helpful comments from Dan Bonachea and Brian Many helpful comments from Dan Bonachea and Brian Wibecan.Wibecan.

V1.0 will be released shortly.V1.0 will be released shortly.

Page 8: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 82003 Michigan Technological University

Collective functionsCollective functions InitializationInitialization

upc_all_initupc_all_init

““Relocalization” collectives change data affinity.Relocalization” collectives change data affinity. upc_all_broadcast upc_all_scatter upc_all_gather upc_all_gather_all upc_all_exchange upc_all_permute

““Computational” collectives for reduction and sorting.Computational” collectives for reduction and sorting. upc_all_reduce upc_all_prefix_reduce upc_all_sort

Page 9: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 92003 Michigan Technological University

void upc_all_broadcast(dst, src, blk);void upc_all_broadcast(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

}blk

Thread 0 sends the same block of data to each thread.

shared [] char src[blk];shared [blk] char dst[blk*THREADS];

Page 10: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 102003 Michigan Technological University

void upc_all_scatter(dst, src, blk);void upc_all_scatter(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Thread 0 sends a unique block of data to each thread.

shared [] char src[blk*THREADS];shared [blk] char dst[blk*THREADS];

Page 11: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 112003 Michigan Technological University

void upc_all_gather(dst, src, blk);void upc_all_gather(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends a block of data to thread 0.

shared [blk] char src[blk*THREADS];shared [] char dst[blk*THREADS];

Page 12: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 122003 Michigan Technological University

void upc_all_gather_all(dst, src, blk);void upc_all_gather_all(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends one block of data to all threads.

Page 13: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 132003 Michigan Technological University

void upc_all_exchange(dst, src, blk);void upc_all_exchange(dst, src, blk);

shared

local

th0 th1 th2

dst dst dst

src src src

Each thread sends a unique block of data to each thread.

Page 14: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 142003 Michigan Technological University

void upc_all_permute(dst, src, perm, blk);void upc_all_permute(dst, src, perm, blk);

shared

local

th0 th1 th2

1 2 0

dst dst dst

src src src

perm perm perm

Thread i sends a block of data to thread perm(i).

Page 15: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 152003 Michigan Technological University

Computational collectivesComputational collectives Reduce and prefix reduceReduce and prefix reduce

One function for each C scalar type, One function for each C scalar type, e.g.e.g.,,upc_all_reduceI(…)upc_all_reduceI(…) returns an integer returns an integer

OperationsOperations +, *, &, |, XOR, &&, ||, min, max+, *, &, |, XOR, &&, ||, min, max user-defined binary functionuser-defined binary function

SortSort User-defined comparison functionUser-defined comparison function

void upc_all_sort(shared void *A,void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk,size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *));int (*func)(shared void *, shared void *));

Page 16: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 162003 Michigan Technological University

int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);int upc_all_reduceI(src, UPC_ADD, n, blk, NULL);

shared

local

src src src

th0 th1 th2

0

9

63

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

shared [3] int src[4*THREADS];int i;

i i i

42 81 16 32 64 128 2565121024 2048

42 81 16 32 64 128 256

44856 35915121024 2048

4095

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

Thread 0 receives UPC_OP src[i].i=0

n

Page 17: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 172003 Michigan Technological University

void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);

shared

local

th0 th1 th2

dst dst dst

src src src

shared [*] int src[3*THREADS], dst[3*THREADS];

0 3 6

0 3 6

1 321642 8 64 128 2561 321642 8 64 128 256

1 324 162 8 64 128 2561 1276315 31 51125533 77 15 31 63 127 255

Thread k receives UPC_OP src[i].i=0

k

Page 18: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 182003 Michigan Technological University

Performance and implementation issuesPerformance and implementation issues

““Push” or “pull”?Push” or “pull”?

Synchronization semanticsSynchronization semantics

Effects of data distributionEffects of data distribution

Page 19: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 192003 Michigan Technological University

shared

local

th0 th1 th2

A “pull” implementation of upc_all_broadcast

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk );}

0 21

dst dst dst

src src src

Page 20: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 202003 Michigan Technological University

shared

local

th0 th1 th2

A “push” implementation of upc_all_broadcast

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk );}

0 21

dst dst dst

src src src

i ii012

Page 21: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 212003 Michigan Technological University

Synchronization semanticsSynchronization semantics

When are function arguments ready?When are function arguments ready?

When are function results available?When are function results available?

Page 22: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 222003 Michigan Technological University

local

shared

Synchronization semanticsSynchronization semantics Arguments with affinity to thread Arguments with affinity to thread ii are ready when are ready when

thread thread ii calls the function; results with affinity to calls the function; results with affinity to thread thread ii are ready when thread are ready when thread ii returns. returns.

This is appealing but it is incorrect: In a broadcast, This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready.thread 1 does not know when thread 0 is ready.

0 21

dst dst dst

src src src

Page 23: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 232003 Michigan Technological University

Synchronization semanticsSynchronization semantics

Require the implementation to provide barriers at function Require the implementation to provide barriers at function entry and exit.entry and exit.

This is convenient for the programming but it is likely to This is convenient for the programming but it is likely to adversely affect performance.adversely affect performance.

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier;}

Page 24: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 242003 Michigan Technological University

Synchronization semanticsSynchronization semantics V1.0 spec: Synchronization is a user responsibility.V1.0 spec: Synchronization is a user responsibility.

#define numelems 10shared [] int A[numelems];shared [numelems] int B[numelems*THREADS];

void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ){ upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk );}..// Initialize A...upc_barrier;upc_all_broadcast( B, A, sizeof(int)*numelems );upc_barrier;

Page 25: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 252003 Michigan Technological University

Performance and implementation issuesPerformance and implementation issues

Data distribution affects both performance and Data distribution affects both performance and implementation.implementation.

Page 26: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 262003 Michigan Technological University

shared127

local

th0 th1 th2

dst dst dst

src src src

shared int src[3*THREADS], dst[3*THREADS];

0 1 2

0 1 2

1 3216 428 64 128 256

void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL);, src, UPC_ADD, n, blk, NULL);

Thread k receives UPC_OP src[i].i=0

k

1 3216 428 64 128 256

1 3 715 31 63255 5113 715 31 63255127

Page 27: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 272003 Michigan Technological University

ExtensionsExtensions

Strided copyingStrided copying

Vectors of offsets for Vectors of offsets for srcsrc and and dstdst arrays arrays

Variable-sized blocksVariable-sized blocks

Reblocking (Reblocking (cf:cf: preceding example of prefix reduce) preceding example of prefix reduce)

shared int src[3*THREADS];shared int src[3*THREADS];

shared [3] int dst[3*THREADS];shared [3] int dst[3*THREADS];

upc_forall(i=0; i<3*THREADS; i++; ?)upc_forall(i=0; i<3*THREADS; i++; ?)

dst[i] = src[i];dst[i] = src[i];

Page 28: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 282003 Michigan Technological University

More sophisticated synchronization More sophisticated synchronization semanticssemantics

Consider the “pull” implementation of broadcast.Consider the “pull” implementation of broadcast.

There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other.

Each thread does a pairwise synchronization with thread 0.

Thread i will not have to wait if it reaches its synchronization point after thread 0.

Thread 0 returns from the call after it has sync’d with each thread.

Page 29: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 292003 Michigan Technological University

What’s next?What’s next?

The V1.0 collective spec will be adopted in the next few The V1.0 collective spec will be adopted in the next few weeks.weeks.

A reference implementation will be available from MTU A reference implementation will be available from MTU immediately afterwards.immediately afterwards.

Page 30: 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

March 19, 2003 302003 Michigan Technological University

MuPC run time system for UPCMuPC run time system for UPC

UPC memory model (Chuck Wallace)UPC memory model (Chuck Wallace)

UPC programmability (Phil Merkey)UPC programmability (Phil Merkey)

UPC test suite (Phil Merkey)UPC test suite (Phil Merkey)

http://www.upc.mtu.edu