X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson...

X10 Tutorial

PSC Software Productivity Study

May 23 – 27, 2005

X10 Tutorial

PSC Software Productivity Study

May 23 – 27, 2005

Vivek Sarkar

IBM T.J. Watson Research Center

vsarkar@us.ibm.com

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA)

under contract No. NBCH30390004.

Vivek Sarkar

IBM T.J. Watson Research Center

vsarkar@us.ibm.com

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA)

under contract No. NBCH30390004.

X10 Tutorial 2

Acknowledgments

• X10 core team

− Philippe Charles

− Chris Donawa

− Kemal Ebcioglu

− Christian Grothoff

− Allan Kielstra

− Christoph von Praun

− Vivek Sarkar

− Vijay Saraswat

• X10 productivity team

− Catalina Danis

− Christine Halverson

• Additional contributors to PSC Productivity Study

− David Bader

− Bill Clark

− Nick Nystrom

− John Urbanic

X10 Tutorial 3

Outline

1. What is X10?

• background, status

2. Basic X10 (single place)

• async, finish, atomic

• future, force

3. Basic X10 (arrays & loops)

• points, rectangular regions, arrays

• for, foreach

4. Scalable X10 (multiple places)

• places, distributions, distributed arrays, ateach, BadPlaceException

5. Clocks

• creation, registration, next, resume, drop, ClockUseException

6. Basic serial constructs that differ from Java

• const, nullable, extern

7. Advanced topics

• Value types, conditional atomic sections (when), general regions & distributions

• Refer to language spec for details

X10 Tutorial 4

What is X10?

• X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS)

• X10’s goal is to provide a new parallel programming model and its embodiment in a high level language that:

1. is more productive than current models,

2. can support higher levels of abstraction better than current models, and

3. can exploit the multiple levels of parallelism and nonuniform data access that are critical for obtaining scalable performance in current and future HPC systems,

X10 Tutorial 5

X10 status and schedule• 6/2003 PERCS programming model concept (end of PERCS Phase 1)

• 7/2004 Start of PERCS Phase 2

• 2/2004 Kickoff of X10 as concrete embodiment of PERCS programming model as a new language

• 7/2004 First draft of X10 language specification

• 2/2005 First X10 implementation -- unoptimized single-VM prototype» Emulates distributed parallelism in a single process» This is what you will use to run X10 programs this week

• 5/2005 X10 productivity study at Pittsburgh Supercomputing Center

• 7/2005 Results from X10 application & productivity studies

• 2H2005 Revise language based on application & productivity feedback

• 2H2005 Start participation in High Productivity Language “consortium”?

• 1/2006 Second X10 implementation – optimized multi-VM prototype

• 6/2006 Open source release of X10 reference implementation

• 6/2006 Design completed for production X10 implementation inPhase 3 (end of Phase 2)

X10 Tutorial 6

Current X10 Environment:Unoptimized Single-VM Implementation

Foo.x10

x10c X10 compiler --- translates Foo.x10 to Foo.java, uses javac to generate Foo.class from Foo.java

Foo.class

X10 source program --- must contain a class named Foo with a “public static void main(String[] args) method

X10 Virtual Machine(JVM + J2SE libraries +

X10 libraries + X10 Multithreaded Runtime)

External DLL’s

X10 externinterface

X10 Abstract Performance Metrics(event counts, distribution efficiency)X10 Program Output

X10 program translated into Java ---// #line pseudocomment in Foo.java specifies source line mapping in Foo.x10

Foo.java

x10c Foo.x10

x10 Foo.x10

Caveat: this is a prototype implementation with many limitations. Please be patient!

X10 Tutorial 7

Examples of X10 Compiler Error Messages

1) x10c TutError1.x10

TutError1.x10:8: Could not find field or local variable "evenSum".

for (int i = 2 ; i <= n ; i += 2 ) evenSum += i;

^----^

x10c: TutError2.x10:4:27:4:27: unexpected token(s) ignored

x10c: C:\vivek\eclipse\workspace\x10\examples\Tutorial\TutError3.java:49:

local variable n is accessed from within inner class; needs to be declared

Case 1: Error message identifies source file and

line number

Case 2: Error message identifies source file, line

number, and column range

Case 1: Carats indicate column range

Case 3: Error message reported by Java compiler – look for #line comment in .java file to

identify X10 source location

X10 Tutorial 8

Future X10 Environment

Very High Level Languages (VHLL’s),

Domain Specific Languages (DSL’s)

X10 High Level Language

X10 Deployment

HPC Runtime Environment

(Parallel Environment, MPI, LAPI, …)

HPC Parallel System

Implicit parallelism,

Implicit data distributions

X10 places --- abstraction of explicit control & data distribution

Mapping of places to nodes in HPC Parallel Environment

Primitive constructs for parallelism, communication, and synchronization

Target system for parallel application

X10 Libraries

X10 Tutorial 9

Future X10 Environment: Targeting Scalable HPC Parallel Systems

Functional Gigabit Ethernet

I/O Node 0

C-Node 0

“Thin”

X10 VM

I/O Node 1023

C-Node 0

“Thin”

X10 VM

C-Node 63

“Thin”

X10 VM

C-Node 63

“Thin”

X10 VM

Console interconnect

interconnectFront-endNodes

Pset 1023

Pset 0File

Servers

“Thick”X10 VM

“Thick”

X10 VM

“Full” X10 VM

X10 Tutorial 10

Functional Gigabit Ethernet

I/O Node 0

C-Node 0

“Thin”

X10 VM

I/O Node 1023

C-Node 0

“Thin”

X10 VM

C-Node 63

“Thin”

X10 VM

C-Node 63

“Thin”

X10 VM

Console interconnect

interconnectFront-endNodes

Pset 1023

Pset 0File

Servers

“Thick”X10 VM

“Thick”

X10 VM

“Full X10 VM”

. . .L3 Cache

Memory

L2 Cache

PEs,L1 $

Proc ClusterPEs,L1 $ . . .

L2 Cache

PEs,L1 $

Proc ClusterPEs,L1 $

Clusters (scale-out)

Multiple cores on a chip

Coprocessors (SPUs)

Future X10 Environment: Targeting Scalable HPC Parallel Systems

X10 Tutorial 11

X10 vs. Java

• X10 is an extended subset of Java

− Base language = Java 1.4• Java 5 features (generics, metadata, etc.) are currently not supported

in X10

− Notable features removed from Java• Concurrency --- threads, synchronized, etc.• Java arrays – replaced by X10 arrays

− Notable features added to Java• Concurrency – async, finish, atomic, future, force, foreach, ateach,

clocks• Distribution --- points, distributions• X10 arrays --- multidimensional distributed arrays, array reductions,

array initializers, • Serial constructs --- nullable, const, extern, value types

• X10 supports both OO and non-OO programming paradigms

X10 Tutorial 12

x10.lang standard library

• Java package with “built in” classes that provide support for selected X10 constructs

− Standard types• boolean, byte, char, double, float, int, long, short, String

− x10.lang.Object -- root class for all instances of X10 objects

− x10.lang.clock --- clock instances & clock operations

− x10.lang.dist --- distribution instances & distribution operations

− x10.lang.place --- place instances & place operations

− x10.lang.point --- point instances & point operations

− x10.lang.region --- region instances & region operations

• All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes

− e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), …

• Similarly, all X10 programs also implicitly import the java.lang.* package

− e.g., X10 programs can use Math.min() and Math.max() from java.lang

X10 Tutorial 13

Calling foreign functions from X10 programs

• Java methods

− Can be called directly from X10 programs

− Java class will be loaded automatically as part of X10 program execution

− Basic rule: don’t call any method that can perform wait/notify or related thread operations

• Calling synchronized methods is okay

• C functions

− Need to use extern declaration in X10, and perform a System.loadLibrary() call

X10 Tutorial 14

Resources available in current X10 installation

• Readme.txt --- basic information on X10 installation and usage

• Limitations.txt --- list of known limitations in the current X10 implementation

• etc/standard.cfg --- default configuration information

• examples/ -- root directory for a number of working X10 example programs

− examples/Constructs shows usage of different X10 constructs

− examples/Tutorial contains examples used in this tutorial

X10 Tutorial 15

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial 16

X10 Programming Model (Single Place)

Activity Stacks (S)

Shared Heap (H)

• Activity = lightweight thread

− Main program starts as single activity in Place 0

• Permitted object references (pointers);

− I H, H I, II, HH, SH, S->I,

• Prohibited references:

− H S, I S, S S

− No data sharing permitted between parent activity’s stack and child activity’s stack

• Single Place Memory model

− No coherence constraints needed for I and S storage classes

− Guaranteed coherence for H storage class --- all writes to same shared location are observed in same order by all activities

− Largest deployment granularity for a single place is a single SMP

Storage classes:

• Immutable Data (I)

• Shared Heap (H)

• Activity Stacks (S)

Immutable Data (I) -- final variables,

value type instances

LocallySynchronous

(coherent access to intra-place shared heap)

Activities

X10 Tutorial 17

Basic X10 (Single Place)

Core constructs used for intra-place (shared memory) parallel programming:

• Async = construct used to execute a statement in parallel as a new activity

• Finish = construct used to check for global termination of statement and all the activities that it has created

• Atomic = construct used to coordinate accesses to shared heap by multiple activities

• Future = construct used to evaluate an expression in parallel as a new activity

• Force = construct used to check for termination of future

X10 Tutorial 18

• async <stmt>

− Parent activity creates a new child activity to execute <stmt> in the same place as the parent activity

− An async statement returns immediately – parent execution proceeds immediately to next statement

− Any access to parent’s local data must be through final variables• Similar to data access rules for inner classes in Java

• Example

public class TutAsync {

const boxedInt oddSum=new boxedInt();

const boxedInt evenSum=new boxedInt();

public static void main(String[] args) {

final int n = 100;

async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i;

for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;

Variable n must be declared as final --- its value is passed from parent to child activity

async statement

X10 Tutorial 19

• finish <stmt>

− Execute <stmt> as usual, but wait until all activities spawned (transitively) by <stmt> have terminated before completing the execution of finish S

− finish traps all exceptions thrown by activities spawned by S, and throws a wrapping exception after S has terminated.

• Example (see TutAsync.x10): . . . finish {

async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;

} // Both oddSum and evenSum have been computed now System.out.println("oddSum = " + oddSum.val + " ; evenSum = " + evenSum.val); } // main()} // TutAsync

finish statement

Console output:

oddSum = 2500 ; evenSum = 2550

X10 Tutorial 20

Atomic statements & methods

• atomic <stmt>, atomic <method-decl>

• An atomic statement/method is conceptually executed in a single step, while other activities are suspended

− Note: programmer does not manage any locks explicitly

• An atomic section may not include

− Blocking operations

− Creation of activities

• Example (see TutAtomic1.x10): finish { async for (int i=1 ; i<=n ; i+=2 ) { double r = 1.0d / i ; atomic rSum += r; } for (int j=2 ; j<=n ; j+=2 ) { double r = 1.0d / j ; atomic rSum += r; } } System.out.println("rSum = " + rSum);

Console output:

rSum = 5.187377517639618

X10 Tutorial 21

Another Example (TutAtomic2.x10)

public class TutAtomic2 {

const int a = new boxedInt(100);

const int b = new boxedInt(100);

public static atomic void incr_a() { a.val++ ; b.val-- ; }

public static atomic void decr_a() { a.val-- ; b.val++ ; }

public static void main(String args[]) {

int sum;

finish {

async for (int i=1 ; i<=10 ; i++ ) incr_a();

for (int i=1 ; i<=10 ; i++ ) decr_a();

atomic sum = a.val + b.val;

System.out.println("a+b = " + sum);

} // main()

} // TutAtomic2Console output:

a+b = 200

X10 Tutorial 22

Future & Force

• future<type> F = future { <expr> }

− Parent activity creates a new asynchronous child activity at <place> to evaluate <expr>

• <type> value = F.force()

− Caller blocks until return value is obtained from future (and all activities spawned transitively by <expr> have terminated )

• Example (see TutFuture2.x10):

// Note that future<int> and int are different types

future<int> Fi = future { fib(10) } ;

int i = Fi.force();

// Nested future types can also be created (if need be)

future<future<int>> FFj= future { future{fib(100)} };

future<int> Fj = FFj.force();

int j = Fj.force();

X10 Tutorial 23

Example (TutFuture1.x10)

public class TutFuture1 {

static int fib(final int n) {

if ( n <= 0 ) return 0;

else if ( n == 1 ) return 1;

else {

future<int> fn_1 = future { fib(n-1) };

future<int> fn_2 = future { fib(n-2) };

return fn_1.force() + fn_2.force();

} // fib()

System.out.println("fib(10) = " + fib(10));

} // main()

} // TutFuture1

Example of recursive divide-and-conquer parallelism --- calls to fib(n-1) and fib(n-2) execute in

parallel

X10 Tutorial 24

Parallel Programming Pitfalls: Deadlock

• Deadlock occurs when parallel threads/activities acquire locks or perform other blocking operations in a sequence that creates a dependence cycle

• Java example:

− Thread 0• synchronized (Foo.a) { synchronized(Foo.b) { … } }

− Thread 1• synchronized (Foo.b) { synchronized(Foo.a) { … } }

• MPI example:

− Process 0: • MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, …)

− Process 1: • MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, …)

X10 Tutorial 25

Parallel Programming Pitfalls: Deadlock (contd.)

• X10 guarantee

− Any program written with async, finish, atomic, foreach, ateach, and clock parallel constructs will never deadlock

• Unrestricted use of future and force may lead to deadlock (see examples/Constructs/Future/FutureDeadlock_MustFailTimeout.x10):

− f1 = future { a1() } ;

− f2 = future { a2() };

− int a1() { … f2.force(); … }

− Int a2() { … f1.force(); … }

• Restricted use of future and force in X10 can preserve guaranteed freedom from deadlocks

− Sufficient condition #1: ensure that activity that creates the future also performs the force() operation

− Sufficient condition #2: . . .

X10 Tutorial 26

Parallel Programming Pitfalls: Data Races

• A data race occurs when two (or more) threads/activities can access the same shared location in parallel such that one of the accesses is a write operation

• Java example:

− Thread 0: a++ ; b-- ;

− Thread 1: a++ ; b--;

− Data race can violate invariant that (a+b) is constant

− Data race may also prevent multiple increments from being combined correctly

• X10 guidelines for avoiding data races

− Use atomic methods and blocks without worrying about deadlock

− Declare data to be read-only (i.e., final or value type instance) whenever possible

X10 Tutorial 27

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial 28

Points

• A point is an element of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …

− Dimensions are numbered from 0 to n-1

− n is also referred to as the rank of the point

• A point variable can hold values of different ranks e.g.,

− point p; p = [1]; … p = [2,3]; …

• The following operations are defined on a point-valued expression p1

− p1.rank --- returns rank of point p1

− p1.get(i) --- returns element i of point p1• Returns element (i mod p1.rank) if i < 0 or i >= p1.rank

− p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2)• Returns true iff p1 is lexicographically <, <=, >, or >= p2 • Only defined when p1.rank and p1.rank are equal

X10 Tutorial 29

Example (see TutPoint.x10)

public class TutPoint {

point p1 = [1,2,3,4,5];

point p2 = [1,2];

point p3 = [2,1];

System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1.get(2) = " + p1.get(2));

System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3));

} // main()

} // TutPoint Console output:

p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1.get(2) = 3p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true

X10 Tutorial 30

Rectangular Regions• A rectangular region is the set of points contained in a rectangular subspace

• A region variable can hold values of different ranks e.g.,

− region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; …

• The following operations are defined on a region-valued expression R

− R.rank = # dimensions in region; R.size() = # points in region

− R.contains(P) = true if region R contains point P

− R.contains(S) = true if region R contains region S

− R.equal(S) = true if region R equals region S

− R.rank(i) = projection of region R on dimension i (a one-dimensional region)

− R.rank(i).low() = lower bound of ith dimension of region R

− R.rank(i).high() = upper bound of ith dimension of region R

− R.ordinal(P) = ordinal value of point P in region R

− R.coord(N) = point in region R with ordinal value = N

− R1 && R2 = region intersection (will be rectangular if R1 and R2 are rectangular)

− R1 || R2 = union of regions R1 and R2 (may not be rectangular)

− R1 – R2 = region difference (may not be rectangular)

X10 Tutorial 31

Example (see TutRegion.x10)

public class TutRegion {

region R1 = [1:10, -100:100];

System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100]));

region R2 = [1:10,90:100];

System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0));

} // main()

} // TutRegionConsole output:

R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90]

X10 Tutorial 32

X10 Arrays

• Java arrays are one-dimensional and local

− e.g., array args in main(String[] args)

− Multi-dimensional arrays are represented as “arrays of arrays” in Java

• X10 has true multi-dimensional arrays (as in C, Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.)

• Array declaration

− “T [.] A” declares an X10 array with element type T

− An array variable can hold values of different rank)

− The [.] syntax is used to avoid confusion with Java arrays

• Array creation

− “new T [ R ]” creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type

− e.g., int[.] A = new int[ [0:N+1, 0:N+1] ];

• Array initializers can also be specified in conjunction with creation (see TutArray1.x10)

− E.g., int[.] A = new int[ [1:10,1:10] ] (point[i,j]) { return i+j; } ;

X10 Tutorial 33

X10 Array Operations

• The following operations are defined on array-valued expression s

− A.rank = # dimensions in array

− A.region = index region (domain) of array

− A[P] = element at point P, where P belongs to A.region

− A | R = restriction of array onto region R• Useful for extracting subarrays

− A.sum(), A.max() = sum/max of elements in array

− A1 op A2 returns result of applying a pointwise op on array elements, when A1.region = A2. region

• Op can include +, -, *, and /

− A1 || A2 = disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint)

− A1.overlay(A2) • Returns an array with region, A1.region || A2.region, with element value A2[P]

for all points P in A2.region and A1[P] otherwise.

− A.distribution = distribution of array A• Will be discussed later when we introduce X10 places

X10 Tutorial 34

Example (see TutArray1.x10)

public class TutArray1 { public static void main(String[] args) { int[.] A = new int[ [1:10,1:10] ] (point [i,j]) { return i+j;} ; System.out.println("A.rank = " + A.rank + " ; A.region = " + A.region); int[.] B = A | [1:5,1:5]; System.out.println("B.max() = " + B.max()); } // main()} // TutArray1

Console output:

A.rank = 2 ; A.region = {1:10,1:10}B.max() = 10

X10 Tutorial 35

Pointwise for loop

• X10 extends Java’s for loop to support sequential iteration over points in region R in canonical lexicographic order

− for ( point p : R ) . . .

• Standard point operations can be used to extract individual index values from point p

− for ( point p : R ) { int i = p.get(0); int j = p.get(1); . . . }

• Or an “exploded” syntax can be used instead of explicitly declaring a point variable

− for ( point [i,j] : R ) { . . . }

• The exploded syntax declares the constituent variables (i, j, …) as local int variables in the scope of the for loop body

X10 Tutorial 36

Example (see TutFor.x10)

public class TutFor {

region R = [0:1,0:2];

System.out.print("Points in region " + R + " =");

for ( point p : R ) System.out.print(" " + p);

System.out.println();

// Use exploded syntax instead

System.out.print("(i,j) pairs in region " + R + " =");

for ( point[i,j] : R )

System.out.print("(" + i + "," + j + ")");

} // main()

} // TutForConsole output:

Points in region {0:1,0:2} = [0,0] [0,1] [0,2] [1,0] [1,1] [1,2](i,j) pairs in region {0:1,0:2} =(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)

X10 Tutorial 37

foreach loop (Parallel iteration)

• The X10 foreach loop is similar to the pointwise for loop, except that each iteration executes in parallel as a new asynchronous activity i.e.,

− “foreach ( point p : R ) S” is equivalent to “for ( point p : R ) async S”

• As before, finish can be used to wait for termination of all foreach iterations

− finish foreach ( point[i,j] : [0:M-1,0:N-1] ) . . .

• Special case: use foreach to create a single-dimensional parallel loop

− foreach ( point[i] : [0 : N-1] ) S

• Allowing a single foreach construct to span multiple dimensions makes it convenient to write parallel matrix code that is independent of the underlying rank and region e.g.

− foreach ( point p : A.region ) A[p] = f(B[p], C[p], D[p]) ;

• Multiple foreach instances may accesses shared data in the same place use finish, atomic, force to avoid data races

X10 Tutorial 38

Example (see TutForeach1.x10)public class TutForeach1 {

final int N = 5;

int[.] A = new int[[1:N,1:N]] (point[i,j]) {return i+j;};

// For the A[i,j] = F(A[i,j]) case,

// both loops can execute in parallel

finish foreach ( point[i,j] : A.region )

A[i,j] = A[i,j] + 1;

// For the A[i,j] = F(A[i,j-1]) case,

// only the outer loop can execute in parallel

finish foreach ( point[i] : A.region.rank(0) )

for (point[j]:

[(A.region.rank(1).low()+1):A.region.rank(1).high()])

A[i,j] = A[i,j-1] + 1;

NOTE: A.region.rank(0) is the same as [1:N]

X10 Tutorial 39

Example contd. (see TutForeach1.x10)

// For the A[i,j] = F(A[i-1,j]) case,

// only the inner loop can execute in parallel

for (point[i]:

[(A.region.rank(0).low()+1):A.region.rank(0).high()] )

finish foreach ( point[j] : A.region.rank(1) )

A[i,j] = A[i-1,j] + 1;

// For the A[i,j] = F(A[i-1,j],A[i,j-1]) case,

// use loop skewing to execute the inner loop in parallel

for ( point[t] : [4:2*N]) {

finish foreach ( point[j] : [Math.max(2,t-N):Math.min(N,t-2)]) {

int i = t - j;

System.out.print("(" + i + "," + j + ")");

A[i,j] = A[i-1,j] + A[i,j-1] + 1;

Console output:(2,2)(3,2)(2,3)(4,2)(3,3)(2,4)(5,2)(3,4)(4,3)(2,5)(5,3)(4,4)(3,5)(5,4)(4,5)(5,5)

X10 Tutorial 40

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial 41

Limitations of using a Single Place

Activity Stacks (S)

Shared Heap (H)

• Largest deployment granularity for a single place is a single SMP

− Smallest granularity can be a single CPU or even a single hardware thread

• Single SMP is inadequate for solving problems with large memory and compute requirements

• X10 solution: incorporate multiple places as a core foundation of the X10 programming model

Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism

Storage classes:

• Shared Heap (H)

Immutable Data (I) -- final variables,

value type instances

LocallySynchronous

Activities

X10 Tutorial 42

Scalable X10: using multiple places

• Place = collection of activities & objects

− Activities and data objects do not move after being created

• Scalar object, O -- maps to a single place specified by O.location

• Array object, A – may be local to a place or distributed across multiple places, as specified by A.distribution

Storage classes:

• PGAS

− Local Heap (LH)

− Remote Heap (RH)

LocallySynchronous

Activity Stacks (S)

Local Heap (LH)

Immutable Data (I) -- final variables, value type instances

Activities

Activity Stacks (S)

Local Heap (LH)

Activities

Outbound activities

Inbound activities

Outbound activityreplies

Inbound activity replies

GloballyAsynchronous

Partitioned Global Address Space (PGAS)

Place 0 Place (MAX_PLACES -1)

X10 Tutorial 43

Locality Rule

• Any access to a mutable (shared heap) datum must be performed by an activity located at the place as the datum

• The prohibited references are similar as before:

− LH/RH S, I S, S S

• Local-to-remote (LH RH) and remote-to-local (RH LH) heap references are freely permitted

• However, direct access via a remote heap reference is not permitted!

• Inter-place data accesses can only be performed by creating remote activities (with weaker ordering guarantees than intra-place data accesses)

• The locality rule is currently not checked by default. Instead, the user can perform the check explicitly by inserting a place cast operator as follows:

− “(@ P) E” checks if expression E can be evaluated at place P• If so, expression E is evaluated as usual• If not, a BadPlaceException is thrown

X10 Tutorial 44

Activity Execution within a Place

Outbound activities

Inbound activities

Outboundreplies Inbound

replies

Ready Activities

CompletedActivities

BlockedActivities

Future

ExecutingActivities

Atomic sections do not have blocking

semantics

Place-local activity can only its stack (S), place-local heap (LH), or immutable data (I)

X10 Tutorial 45

Places• place.MAX_PLACES = total number of places

− Default value is 4

− Can be changed by using the -NUMBER_OF_LOCAL_PLACES option in x10 command

• place.places = Set of all places in an X10 program(see java.lang.Set)

• place.factory.place(i) = place corresponding to index i

• here = place in which current activity is executing

• <place-expr>.toString() returns a string of the form “place(id=99)”

• <place-expr>.id returns the id of the place

X10 Places

System Nodes

X10 language defines mapping from X10 objects to X10 places, and abstract

performance metrics on places

X10 Data Structures

Future X10 deployment system will define mapping from X10 places to system nodes;

not supported in current implementation

X10 Tutorial 46

Extension of async and future to places

• async (P) S

− Creates new activity to execute statement S at place P

− “async S” is equivalent to “async (here) S”

• future (P) { E }

− Create new activity to evaluate expression E at place P

− “future { E } ” is equivalent to “future (here) { E }”

• Note that “here” in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing

• The goal is to specify the destination place for async/future activities so as to obey the Locality Rule e.g.,

− async (O.location) O.x = 1;

− future<int> F = future (A.distribution[i]) { A[i] } ;

X10 Tutorial 47

Distribution = mapping from region to places

• Creating distributions (x10.lang.dist):

− dist D1 = R-> here; // local distribution – maps region R to here

− dist D2 = dist.factory.block(R); // blocked distribution

− dist D3 = dist.factory.cyclic(R); // cyclic distribution

− dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1]

• Using distributions

− D[P] = place to which point P is mapped by distribution D (assuming that P is in D.region)

− Allocate a distributed array e.g., T[.] A = new T[ D ];• Allocates an array with index set = D.region, such that element

A[P] is located at place D[P] for each point P in D.region• NOTE: “new T[R]” for region R is equivalent to “new T[R->here]”

− Iterating over a distribution – generalization of “foreach” to “ateach”• ateach is discussed in more detail later

X10 Tutorial 48

Operations defined on distributions

• D.region = source region of distribution

• D.rank = rank of D.region

• D | R = region restriction for distribution D and region R (returns a restricted distribution)

• D | P = place restriction for distribution D and place P (returns region mapped by D to place P)

• D1 || D2 = union of distributions D1 and D2 (assumes that D1.region and D2.region are disjoint)

• D1.overlay(D2); // Overlay of D2 over D1 – asymmetric union

• D.contains(p) = true iff D.region contains point p

• D = R -> P, constant distribution which maps entire region R to place P

• D1 – D2 = distribution difference = D1 | (D1.region – D2.region)

• D.distributionEfficiency() = load balance efficiency of distribution D

X10 Tutorial 49

Inter-place communication using async and future

• Question: how to assign A[i] = B[j], when A[i] and and B[j] may be in different places?

• Answer #1 --- use nested async’s!

finish async ( B.distribution[j] ) {

final int bb = B[j];

async ( A.distribution[i] ) A[i] = bb;

• Answer #2 --- use future-force and an async!

final int b = future (B.distribution[j]) { B[j] }.force();

finish async ( A.distribution[i] ) A[i] = b;

X10 Tutorial 50

Load Balance Efficiency

• Consider a parallel application that is executed on P places

• Let T(i) = computation load mapped to place i

− For distribution D, T(i) = (D | place.factory.place(i)).size()

• Let Tmax = max { T(i) | 1 <= i <= P }

• Let E = SUM { T(i) | 1 <= i <= P } / (Tmax * P)

• E is the load balance efficiency, 1/P <= E <= 1

• E = 1 is the best case computation load is perfectly balanced

• E = 1/P is the worst case computation load is placed on a single processor/place

• Load balance efficiency is one of the key factors that limit speedup on a parallel machine

− there are several other factors e.g., comm. & synchronization overhead

− ignoring other factors, we expect speedup to be <= E * P

• NOTE: also try “x10 –DUMP_STATS_ON_EXIT=true …” to see activity and atomic counts

X10 Tutorial 51

ateach loop (distributed parallel iteration)

• The X10 ateach loop is similar to the foreach loop, except that each iteration executes in parallel at a place specified by a distribution

− “ateach ( point p : D ) S ” is equivalent to “for ( point p : D.region ) async (D[p]) S”

• As before, finish can be used to wait for termination of all ateach iterations− “finish ateach( point[i] : dist.factory.unique() ) S” creates one

activity per place, as in an SPMD computation− ateach is a convenient construct for writing parallel matrix

code that is independent of the underlying distribution e.g.,• ateach ( point p : A.distribution ) A[p] = f(B[p], C[p], D[p]) ;

X10 Tutorial 52

Example (see TutAteach1.x10)

public class TutAteach1 {

public static void main(String args[]) {

finish ateach( point[i] : dist.factory.unique() ) {

System.out.println("Hello from " + i);

} // main()

} // TutAteach1

Console output:Hello from 1Hello from 0Hello from 3Hello from 2

dist.factory.unique() maps point i in the region, [0 : place.MAX_PLACES-1], to place place.factory.place(i)

X10 Tutorial 53

Example: converting foreach to ateach (see TutAteach2.x10)

foreach version:

// For the A[i,j] = F(A[i,j]) case,

// both loops can execute in parallel

finish foreach ( point[i,j] : A.region )

A[i,j] = A[i,j] + 1;

ateach version #1:

finish ateach ( point[i,j] : A.distribution)

A[i,j] = A[i,j] + 1;

ateach version #2 (create only one activity per place):

finish ateach ( point p : dist.factory.unique() )

for ( point[i,j] : A.distribution | here )

A[i,j] = A[i,j] + 1;

X10 Tutorial 54

Example: converting foreach to ateach, contd. (see TutAteach2.x10)

foreach version:

// For the A[i,j] = F(A[i,j-1]) case,

// only the outer loop can execute in parallel

finish foreach ( point[i] : [1:N] )

for ( point[j]: [2:N] )

A[i,j] = A[i,j-1] + 1;

ateach version:

// Assume that N is a multiple of place.MAX_PLACES

finish ateach ( point[i] : dist.factory.block([1:N]) )

for ( point[j]: [2:N] )

A[i,j] = A[i,j-1] + 1;

X10 Tutorial 55

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial 56

X10 clocks: Motivation

• Activity coordination using finish and force() is accomplished by checking for activity termination

• However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination

− The activities involved may be in the same place or in different places

Activity 0 Activity 1 Activity 2 . . .

Phase 0

Phase 1

X10 Tutorial 57

X10 Clocks

clock c = clock.factory.clock();

− Allocate a clock, register current activity with it. Phase 0 of c starts.

async(…) clocked (c1,c2,…) S

ateach(…) clocked (c1,c2,…) S

foreach(…) clocked (c1,c2,…) S

− Create async activities registered on clocks c1, c2, …

c.resume();

− Nonblocking operation that signals completion of work by current activity for this phase of clock c

− Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed.

− Next can be viewed like a “finish” of all computations under way in the current phase of the clock

X10 Tutorial 58

X10 Clocks (contd.)

c.drop();− Unregister with c. A terminating activity will implicitly drop all clocks that it is

registered on.

c.registered()− Return true iff current activity is registered on clock c− c.dropped() returns the opposite of c.registered()

ClockUseException− Thrown if an activity attempts to transmit or operate on a clock that it is not

registered on

X10 Tutorial 59

Example (see TutClock1.x10) finish async {

final clock c = clock.factory.clock();

foreach (point[i]: [1:N]) clocked (c) {

while ( true ) {

int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]);

if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]);

if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]);

A[i] = new_A_i;

int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]);

if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]);

if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]);

B[i] = new_B_i;

if ( old_A_i == new_A_i && old_B_i == new_B_i ) break;

} // while

} // foreach

} // finish async

NOTE: exiting from while loop terminates activity

for iteration i, and automatically deregisters

activity from clock

Example of transmitting clock from parent to

X10 Tutorial 60

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial 61

nullable

• By default, object references in X10 are not allowed to take on the null value

• However, the nullable type constructor can be used to enable certain object references to be set to null, or to compare them with null e.g.,

nullable T2 b;

a = null; // Not allowed

b = null; // Allowed

• NOTE: “const” is simply a shorthand for “static final”

X10 Tutorial 62

extern

• X10 provides a simple mechanism for invoking external functions written in C

• Currently, the C function is restricted to arguments with primitive types or references to “unsafe” X10 arrays

• The X10 program must contain an external declaration of the C function as follows …

static extern char doit(int a, float b)

… and also a statement to ensure that the native DLL, <dll>.dll is loaded

static { System.loadLibrary(“<dll>");}

• The X10 compiler then generates a file called <class>_x10stub.c

• To generate the DLL, the C programmer must compile the C function by including the file jni.h in tehir C function, and must link with the object file obtained from <class>_x10stub.c

X10 Tutorial 63

Outline

1. What is X10?

• future, force

• for, foreach

5. Clocks

7. Advanced topics

X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson...

Documents

2012 guide of the yeAr t.J. Clark - Wyoming Outfitters and ...€¦ · t.J. Clark sponsored by T.J. Clark is a 4th generation native of Wyoming raised in Powell where , hunting, pack

T.J. Morgan Resume 2017

Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

An Introduction to Control Theory With Applications to Computer Science Joseph Hellerstein And Sujay Parekh IBM T.J. Watson Research Center {hellers,sujay}@us.ibm.com

arXiv:1708.03684v2 [cs.DM] 1 Oct 2017 Introduction to Quantum Computing, Without the Physics Giacomo Nannicini IBM T.J. Watson, Yorktown Heights, NY nannicini@us.ibm.com Last updated:

X10: Programming for Hierarchical Parallelism and ... · vsarkar@us.ibm.com LaR 2004 Workshop OOPSLA 2004 This work has been supported in part by the Defense Advanced Research Projects

IBM i and BladeCenter 2Q 2009 Update Vess Natchev and Kyle Wurgler vess@us.ibm.comvess@us.ibm.com, wurgler@us.ibm.com,wurgler@us.ibm.com IBM Systems Lab

Center of Excellence in Wireless and Information Technology · 2018-09-09 · IBM T.J. Watson Research Center, Hawthorne, NY 10532 {cpwright,nahum}@us.ibm.com Voice-over-IP (VoIP)

Smile for Me - T.J. Dell

Saras Shareable Rich Media Learning Object Repositories and Management for e-Learning Chitra Dorai IBM T.J. Watson Research Center New York dorai@us.ibm.com

© 2009 IBM Corporation Doug Mack mackd@us.ibm.commackd@us.ibm.com Gene Cobb cobbg@us.ibm.comcobbg@us.ibm.com qu2@us.ibm.com

T.J. Russell Portfolio

Xusheng Xiao, Tao Xie North Carolina State University xxiao2,txie@ncsu.edu Amit Paradkar IBM T.J. Watson Research Center paradkar@us.ibm.com

InfoSphere Streams for Real Time Analytics in Financial Services Industry Krishna Mamidipaka, krishnag@us.ibm.com Roger Rea, rrea@us.ibm.com

Automatic Copying of Pointer-Based Data Structures Copying of Pointer-Based Data Structures Tong Chen, Zehra Sura, and Hyojin Sung IBM T.J. Watson Research Center, New York, USA, fchentong,zsura,hsungg@us.ibm.com

Bluetooth Architecture Overview Dr. Chatschik Bisdikian IBM Research T.J. Watson Research Center Hawthorne, NY 10532, USA bisdik@us.ibm.com

Lecture 8 - LVCSR DecodingIBM Lecture 8 LVCSR Decoding Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {bhuvana,picheny,stanchen}@us.ibm.com

T.J. Ahrens- Equation of State

Unsupervised Entity-Relation Analysis in IBM … Entity-Relation Analysis in IBM Watson Aditya Kalyanpur ADITYAKAL@US.IBM.COM J. William Murdock MURDOCKJ@US.IBM.COM IBM T.J. Watson

Warren Heising and Joe Kennedy, IBM Corp. IBM Information Integration: Federated Queries heising@us.ibm.comheising@us.ibm.com and joekenn@us.ibm.com joekenn@us.ibm.com