View
242
Download
0
Category
Tags:
Preview:
Citation preview
X10 Tutorial
PSC Software Productivity Study
May 23 – 27, 2005
X10 Tutorial
PSC Software Productivity Study
May 23 – 27, 2005
Vivek Sarkar
IBM T.J. Watson Research Center
vsarkar@us.ibm.com
This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA)
under contract No. NBCH30390004.
Vivek Sarkar
IBM T.J. Watson Research Center
vsarkar@us.ibm.com
This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA)
under contract No. NBCH30390004.
X10 Tutorial 2
Acknowledgments
• X10 core team
− Philippe Charles
− Chris Donawa
− Kemal Ebcioglu
− Christian Grothoff
− Allan Kielstra
− Christoph von Praun
− Vivek Sarkar
− Vijay Saraswat
• X10 productivity team
− Catalina Danis
− Christine Halverson
• Additional contributors to PSC Productivity Study
− David Bader
− Bill Clark
− Nick Nystrom
− John Urbanic
X10 Tutorial 3
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 4
What is X10?
• X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS)
• X10’s goal is to provide a new parallel programming model and its embodiment in a high level language that:
1. is more productive than current models,
2. can support higher levels of abstraction better than current models, and
3. can exploit the multiple levels of parallelism and nonuniform data access that are critical for obtaining scalable performance in current and future HPC systems,
X10 Tutorial 5
X10 status and schedule• 6/2003 PERCS programming model concept (end of PERCS Phase 1)
• 7/2004 Start of PERCS Phase 2
• 2/2004 Kickoff of X10 as concrete embodiment of PERCS programming model as a new language
• 7/2004 First draft of X10 language specification
• 2/2005 First X10 implementation -- unoptimized single-VM prototype» Emulates distributed parallelism in a single process» This is what you will use to run X10 programs this week
• 5/2005 X10 productivity study at Pittsburgh Supercomputing Center
• 7/2005 Results from X10 application & productivity studies
• 2H2005 Revise language based on application & productivity feedback
• 2H2005 Start participation in High Productivity Language “consortium”?
• 1/2006 Second X10 implementation – optimized multi-VM prototype
• 6/2006 Open source release of X10 reference implementation
• 6/2006 Design completed for production X10 implementation inPhase 3 (end of Phase 2)
X10 Tutorial 6
Current X10 Environment:Unoptimized Single-VM Implementation
Foo.x10
x10c X10 compiler --- translates Foo.x10 to Foo.java, uses javac to generate Foo.class from Foo.java
Foo.class
X10 source program --- must contain a class named Foo with a “public static void main(String[] args) method
X10 Virtual Machine(JVM + J2SE libraries +
X10 libraries + X10 Multithreaded Runtime)
External DLL’s
X10 externinterface
X10 Abstract Performance Metrics(event counts, distribution efficiency)X10 Program Output
X10 program translated into Java ---// #line pseudocomment in Foo.java specifies source line mapping in Foo.x10
Foo.java
x10c Foo.x10
x10 Foo.x10
Caveat: this is a prototype implementation with many limitations. Please be patient!
X10 Tutorial 7
Examples of X10 Compiler Error Messages
1) x10c TutError1.x10
TutError1.x10:8: Could not find field or local variable "evenSum".
for (int i = 2 ; i <= n ; i += 2 ) evenSum += i;
^----^
2) x10c TutError2.x10
x10c: TutError2.x10:4:27:4:27: unexpected token(s) ignored
3) x10c TutError3.x10
x10c: C:\vivek\eclipse\workspace\x10\examples\Tutorial\TutError3.java:49:
local variable n is accessed from within inner class; needs to be declared
final
Case 1: Error message identifies source file and
line number
Case 2: Error message identifies source file, line
number, and column range
Case 1: Carats indicate column range
Case 3: Error message reported by Java compiler – look for #line comment in .java file to
identify X10 source location
X10 Tutorial 8
Future X10 Environment
Very High Level Languages (VHLL’s),
Domain Specific Languages (DSL’s)
X10 High Level Language
X10 Deployment
HPC Runtime Environment
(Parallel Environment, MPI, LAPI, …)
HPC Parallel System
Implicit parallelism,
Implicit data distributions
X10 places --- abstraction of explicit control & data distribution
Mapping of places to nodes in HPC Parallel Environment
Primitive constructs for parallelism, communication, and synchronization
Target system for parallel application
X10 Libraries
X10 Tutorial 9
Future X10 Environment: Targeting Scalable HPC Parallel Systems
Functional Gigabit Ethernet
Functional Gigabit Ethernet
I/O Node 0
C-Node 0
“Thin”
X10 VM
I/O Node 1023
C-Node 0
“Thin”
X10 VM
C-Node 63
“Thin”
X10 VM
C-Node 63
“Thin”
X10 VM
Console interconnect
interconnectFront-endNodes
Pset 1023
Pset 0File
Servers
“Thick”X10 VM
“Thick”
X10 VM
. . .
“Full” X10 VM
. . .
. . .
X10 Tutorial 10
Functional Gigabit Ethernet
Functional Gigabit Ethernet
I/O Node 0
C-Node 0
“Thin”
X10 VM
I/O Node 1023
C-Node 0
“Thin”
X10 VM
C-Node 63
“Thin”
X10 VM
C-Node 63
“Thin”
X10 VM
Console interconnect
interconnectFront-endNodes
Pset 1023
Pset 0File
Servers
“Thick”X10 VM
“Thick”
X10 VM
. . .
“Full X10 VM”
. . .
. . .L3 Cache
Memory
. . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $ . . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $
. . .
. . .
Clusters (scale-out)
SMP
Multiple cores on a chip
Coprocessors (SPUs)
SMTs
SIMD
ILP
Future X10 Environment: Targeting Scalable HPC Parallel Systems
X10 Tutorial 11
X10 vs. Java
• X10 is an extended subset of Java
− Base language = Java 1.4• Java 5 features (generics, metadata, etc.) are currently not supported
in X10
− Notable features removed from Java• Concurrency --- threads, synchronized, etc.• Java arrays – replaced by X10 arrays
− Notable features added to Java• Concurrency – async, finish, atomic, future, force, foreach, ateach,
clocks• Distribution --- points, distributions• X10 arrays --- multidimensional distributed arrays, array reductions,
array initializers, • Serial constructs --- nullable, const, extern, value types
• X10 supports both OO and non-OO programming paradigms
X10 Tutorial 12
x10.lang standard library
• Java package with “built in” classes that provide support for selected X10 constructs
− Standard types• boolean, byte, char, double, float, int, long, short, String
− x10.lang.Object -- root class for all instances of X10 objects
− x10.lang.clock --- clock instances & clock operations
− x10.lang.dist --- distribution instances & distribution operations
− x10.lang.place --- place instances & place operations
− x10.lang.point --- point instances & point operations
− x10.lang.region --- region instances & region operations
• All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes
− e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), …
• Similarly, all X10 programs also implicitly import the java.lang.* package
− e.g., X10 programs can use Math.min() and Math.max() from java.lang
X10 Tutorial 13
Calling foreign functions from X10 programs
• Java methods
− Can be called directly from X10 programs
− Java class will be loaded automatically as part of X10 program execution
− Basic rule: don’t call any method that can perform wait/notify or related thread operations
• Calling synchronized methods is okay
• C functions
− Need to use extern declaration in X10, and perform a System.loadLibrary() call
X10 Tutorial 14
Resources available in current X10 installation
• Readme.txt --- basic information on X10 installation and usage
• Limitations.txt --- list of known limitations in the current X10 implementation
• etc/standard.cfg --- default configuration information
• examples/ -- root directory for a number of working X10 example programs
− examples/Constructs shows usage of different X10 constructs
− examples/Tutorial contains examples used in this tutorial
X10 Tutorial 15
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 16
X10 Programming Model (Single Place)
Activity Stacks (S)
Shared Heap (H)
• Activity = lightweight thread
− Main program starts as single activity in Place 0
• Permitted object references (pointers);
− I H, H I, II, HH, SH, S->I,
• Prohibited references:
− H S, I S, S S
− No data sharing permitted between parent activity’s stack and child activity’s stack
• Single Place Memory model
− No coherence constraints needed for I and S storage classes
− Guaranteed coherence for H storage class --- all writes to same shared location are observed in same order by all activities
− Largest deployment granularity for a single place is a single SMP
Storage classes:
• Immutable Data (I)
• Shared Heap (H)
• Activity Stacks (S)
Immutable Data (I) -- final variables,
value type instances
LocallySynchronous
(coherent access to intra-place shared heap)
. . .
Activities
Pla
ce
0
X10 Tutorial 17
Basic X10 (Single Place)
Core constructs used for intra-place (shared memory) parallel programming:
• Async = construct used to execute a statement in parallel as a new activity
• Finish = construct used to check for global termination of statement and all the activities that it has created
• Atomic = construct used to coordinate accesses to shared heap by multiple activities
• Future = construct used to evaluate an expression in parallel as a new activity
• Force = construct used to check for termination of future
X10 Tutorial 18
• async <stmt>
− Parent activity creates a new child activity to execute <stmt> in the same place as the parent activity
− An async statement returns immediately – parent execution proceeds immediately to next statement
− Any access to parent’s local data must be through final variables• Similar to data access rules for inner classes in Java
• Example
public class TutAsync {
const boxedInt oddSum=new boxedInt();
const boxedInt evenSum=new boxedInt();
public static void main(String[] args) {
final int n = 100;
async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i;
for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;
Variable n must be declared as final --- its value is passed from parent to child activity
async statement
X10 Tutorial 19
• finish <stmt>
− Execute <stmt> as usual, but wait until all activities spawned (transitively) by <stmt> have terminated before completing the execution of finish S
− finish traps all exceptions thrown by activities spawned by S, and throws a wrapping exception after S has terminated.
• Example (see TutAsync.x10): . . . finish {
async for (int i=1 ; i<=n ; i+=2 ) oddSum.val += i; for (int j=2 ; j<=n ; j+=2 ) evenSum.val += j;
} // Both oddSum and evenSum have been computed now System.out.println("oddSum = " + oddSum.val + " ; evenSum = " + evenSum.val); } // main()} // TutAsync
finish statement
Console output:
oddSum = 2500 ; evenSum = 2550
X10 Tutorial 20
Atomic statements & methods
• atomic <stmt>, atomic <method-decl>
• An atomic statement/method is conceptually executed in a single step, while other activities are suspended
− Note: programmer does not manage any locks explicitly
• An atomic section may not include
− Blocking operations
− Creation of activities
• Example (see TutAtomic1.x10): finish { async for (int i=1 ; i<=n ; i+=2 ) { double r = 1.0d / i ; atomic rSum += r; } for (int j=2 ; j<=n ; j+=2 ) { double r = 1.0d / j ; atomic rSum += r; } } System.out.println("rSum = " + rSum);
Console output:
rSum = 5.187377517639618
X10 Tutorial 21
Another Example (TutAtomic2.x10)
public class TutAtomic2 {
const int a = new boxedInt(100);
const int b = new boxedInt(100);
public static atomic void incr_a() { a.val++ ; b.val-- ; }
public static atomic void decr_a() { a.val-- ; b.val++ ; }
public static void main(String args[]) {
int sum;
finish {
async for (int i=1 ; i<=10 ; i++ ) incr_a();
for (int i=1 ; i<=10 ; i++ ) decr_a();
}
atomic sum = a.val + b.val;
System.out.println("a+b = " + sum);
} // main()
} // TutAtomic2Console output:
a+b = 200
X10 Tutorial 22
Future & Force
• future<type> F = future { <expr> }
− Parent activity creates a new asynchronous child activity at <place> to evaluate <expr>
• <type> value = F.force()
− Caller blocks until return value is obtained from future (and all activities spawned transitively by <expr> have terminated )
• Example (see TutFuture2.x10):
// Note that future<int> and int are different types
future<int> Fi = future { fib(10) } ;
int i = Fi.force();
// Nested future types can also be created (if need be)
future<future<int>> FFj= future { future{fib(100)} };
future<int> Fj = FFj.force();
int j = Fj.force();
X10 Tutorial 23
Example (TutFuture1.x10)
public class TutFuture1 {
static int fib(final int n) {
if ( n <= 0 ) return 0;
else if ( n == 1 ) return 1;
else {
future<int> fn_1 = future { fib(n-1) };
future<int> fn_2 = future { fib(n-2) };
return fn_1.force() + fn_2.force();
}
} // fib()
public static void main(String[] args) {
System.out.println("fib(10) = " + fib(10));
} // main()
} // TutFuture1
Example of recursive divide-and-conquer parallelism --- calls to fib(n-1) and fib(n-2) execute in
parallel
X10 Tutorial 24
Parallel Programming Pitfalls: Deadlock
• Deadlock occurs when parallel threads/activities acquire locks or perform other blocking operations in a sequence that creates a dependence cycle
• Java example:
− Thread 0• synchronized (Foo.a) { synchronized(Foo.b) { … } }
− Thread 1• synchronized (Foo.b) { synchronized(Foo.a) { … } }
• MPI example:
− Process 0: • MPI_Recv(recvbuf, count, MPI_REAL, 1, tag, …)
− Process 1: • MPI_Recv(recvbuf, count, MPI_REAL, 0, tag, …)
X10 Tutorial 25
Parallel Programming Pitfalls: Deadlock (contd.)
• X10 guarantee
− Any program written with async, finish, atomic, foreach, ateach, and clock parallel constructs will never deadlock
• Unrestricted use of future and force may lead to deadlock (see examples/Constructs/Future/FutureDeadlock_MustFailTimeout.x10):
− f1 = future { a1() } ;
− f2 = future { a2() };
− int a1() { … f2.force(); … }
− Int a2() { … f1.force(); … }
• Restricted use of future and force in X10 can preserve guaranteed freedom from deadlocks
− Sufficient condition #1: ensure that activity that creates the future also performs the force() operation
− Sufficient condition #2: . . .
X10 Tutorial 26
Parallel Programming Pitfalls: Data Races
• A data race occurs when two (or more) threads/activities can access the same shared location in parallel such that one of the accesses is a write operation
• Java example:
− Thread 0: a++ ; b-- ;
− Thread 1: a++ ; b--;
− Data race can violate invariant that (a+b) is constant
− Data race may also prevent multiple increments from being combined correctly
• X10 guidelines for avoiding data races
− Use atomic methods and blocks without worrying about deadlock
− Declare data to be read-only (i.e., final or value type instance) whenever possible
X10 Tutorial 27
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 28
Points
• A point is an element of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …
− Dimensions are numbered from 0 to n-1
− n is also referred to as the rank of the point
• A point variable can hold values of different ranks e.g.,
− point p; p = [1]; … p = [2,3]; …
• The following operations are defined on a point-valued expression p1
− p1.rank --- returns rank of point p1
− p1.get(i) --- returns element i of point p1• Returns element (i mod p1.rank) if i < 0 or i >= p1.rank
− p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2)• Returns true iff p1 is lexicographically <, <=, >, or >= p2 • Only defined when p1.rank and p1.rank are equal
X10 Tutorial 29
Example (see TutPoint.x10)
public class TutPoint {
public static void main(String[] args) {
point p1 = [1,2,3,4,5];
point p2 = [1,2];
point p3 = [2,1];
System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1.get(2) = " + p1.get(2));
System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3));
} // main()
} // TutPoint Console output:
p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1.get(2) = 3p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true
X10 Tutorial 30
Rectangular Regions• A rectangular region is the set of points contained in a rectangular subspace
• A region variable can hold values of different ranks e.g.,
− region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; …
• The following operations are defined on a region-valued expression R
− R.rank = # dimensions in region; R.size() = # points in region
− R.contains(P) = true if region R contains point P
− R.contains(S) = true if region R contains region S
− R.equal(S) = true if region R equals region S
− R.rank(i) = projection of region R on dimension i (a one-dimensional region)
− R.rank(i).low() = lower bound of ith dimension of region R
− R.rank(i).high() = upper bound of ith dimension of region R
− R.ordinal(P) = ordinal value of point P in region R
− R.coord(N) = point in region R with ordinal value = N
− R1 && R2 = region intersection (will be rectangular if R1 and R2 are rectangular)
− R1 || R2 = union of regions R1 and R2 (may not be rectangular)
− R1 – R2 = region difference (may not be rectangular)
X10 Tutorial 31
Example (see TutRegion.x10)
public class TutRegion {
public static void main(String[] args) {
region R1 = [1:10, -100:100];
System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100]));
region R2 = [1:10,90:100];
System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0));
} // main()
} // TutRegionConsole output:
R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90]
X10 Tutorial 32
X10 Arrays
• Java arrays are one-dimensional and local
− e.g., array args in main(String[] args)
− Multi-dimensional arrays are represented as “arrays of arrays” in Java
• X10 has true multi-dimensional arrays (as in C, Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.)
• Array declaration
− “T [.] A” declares an X10 array with element type T
− An array variable can hold values of different rank)
− The [.] syntax is used to avoid confusion with Java arrays
• Array creation
− “new T [ R ]” creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type
− e.g., int[.] A = new int[ [0:N+1, 0:N+1] ];
• Array initializers can also be specified in conjunction with creation (see TutArray1.x10)
− E.g., int[.] A = new int[ [1:10,1:10] ] (point[i,j]) { return i+j; } ;
X10 Tutorial 33
X10 Array Operations
• The following operations are defined on array-valued expression s
− A.rank = # dimensions in array
− A.region = index region (domain) of array
− A[P] = element at point P, where P belongs to A.region
− A | R = restriction of array onto region R• Useful for extracting subarrays
− A.sum(), A.max() = sum/max of elements in array
− A1 op A2 returns result of applying a pointwise op on array elements, when A1.region = A2. region
• Op can include +, -, *, and /
− A1 || A2 = disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint)
− A1.overlay(A2) • Returns an array with region, A1.region || A2.region, with element value A2[P]
for all points P in A2.region and A1[P] otherwise.
− A.distribution = distribution of array A• Will be discussed later when we introduce X10 places
X10 Tutorial 34
Example (see TutArray1.x10)
public class TutArray1 { public static void main(String[] args) { int[.] A = new int[ [1:10,1:10] ] (point [i,j]) { return i+j;} ; System.out.println("A.rank = " + A.rank + " ; A.region = " + A.region); int[.] B = A | [1:5,1:5]; System.out.println("B.max() = " + B.max()); } // main()} // TutArray1
Console output:
A.rank = 2 ; A.region = {1:10,1:10}B.max() = 10
X10 Tutorial 35
Pointwise for loop
• X10 extends Java’s for loop to support sequential iteration over points in region R in canonical lexicographic order
− for ( point p : R ) . . .
• Standard point operations can be used to extract individual index values from point p
− for ( point p : R ) { int i = p.get(0); int j = p.get(1); . . . }
• Or an “exploded” syntax can be used instead of explicitly declaring a point variable
− for ( point [i,j] : R ) { . . . }
• The exploded syntax declares the constituent variables (i, j, …) as local int variables in the scope of the for loop body
X10 Tutorial 36
Example (see TutFor.x10)
public class TutFor {
public static void main(String[] args) {
region R = [0:1,0:2];
System.out.print("Points in region " + R + " =");
for ( point p : R ) System.out.print(" " + p);
System.out.println();
// Use exploded syntax instead
System.out.print("(i,j) pairs in region " + R + " =");
for ( point[i,j] : R )
System.out.print("(" + i + "," + j + ")");
System.out.println();
} // main()
} // TutForConsole output:
Points in region {0:1,0:2} = [0,0] [0,1] [0,2] [1,0] [1,1] [1,2](i,j) pairs in region {0:1,0:2} =(0,0)(0,1)(0,2)(1,0)(1,1)(1,2)
X10 Tutorial 37
foreach loop (Parallel iteration)
• The X10 foreach loop is similar to the pointwise for loop, except that each iteration executes in parallel as a new asynchronous activity i.e.,
− “foreach ( point p : R ) S” is equivalent to “for ( point p : R ) async S”
• As before, finish can be used to wait for termination of all foreach iterations
− finish foreach ( point[i,j] : [0:M-1,0:N-1] ) . . .
• Special case: use foreach to create a single-dimensional parallel loop
− foreach ( point[i] : [0 : N-1] ) S
• Allowing a single foreach construct to span multiple dimensions makes it convenient to write parallel matrix code that is independent of the underlying rank and region e.g.
− foreach ( point p : A.region ) A[p] = f(B[p], C[p], D[p]) ;
• Multiple foreach instances may accesses shared data in the same place use finish, atomic, force to avoid data races
X10 Tutorial 38
Example (see TutForeach1.x10)public class TutForeach1 {
public static void main(String[] args) {
final int N = 5;
int[.] A = new int[[1:N,1:N]] (point[i,j]) {return i+j;};
// For the A[i,j] = F(A[i,j]) case,
// both loops can execute in parallel
finish foreach ( point[i,j] : A.region )
A[i,j] = A[i,j] + 1;
// For the A[i,j] = F(A[i,j-1]) case,
// only the outer loop can execute in parallel
finish foreach ( point[i] : A.region.rank(0) )
for (point[j]:
[(A.region.rank(1).low()+1):A.region.rank(1).high()])
A[i,j] = A[i,j-1] + 1;
NOTE: A.region.rank(0) is the same as [1:N]
X10 Tutorial 39
Example contd. (see TutForeach1.x10)
// For the A[i,j] = F(A[i-1,j]) case,
// only the inner loop can execute in parallel
for (point[i]:
[(A.region.rank(0).low()+1):A.region.rank(0).high()] )
finish foreach ( point[j] : A.region.rank(1) )
A[i,j] = A[i-1,j] + 1;
// For the A[i,j] = F(A[i-1,j],A[i,j-1]) case,
// use loop skewing to execute the inner loop in parallel
for ( point[t] : [4:2*N]) {
finish foreach ( point[j] : [Math.max(2,t-N):Math.min(N,t-2)]) {
int i = t - j;
System.out.print("(" + i + "," + j + ")");
A[i,j] = A[i-1,j] + A[i,j-1] + 1;
}
System.out.println();
Console output:(2,2)(3,2)(2,3)(4,2)(3,3)(2,4)(5,2)(3,4)(4,3)(2,5)(5,3)(4,4)(3,5)(5,4)(4,5)(5,5)
X10 Tutorial 40
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 41
Limitations of using a Single Place
Activity Stacks (S)
Shared Heap (H)
• Largest deployment granularity for a single place is a single SMP
− Smallest granularity can be a single CPU or even a single hardware thread
• Single SMP is inadequate for solving problems with large memory and compute requirements
• X10 solution: incorporate multiple places as a core foundation of the X10 programming model
Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism
Storage classes:
• Immutable Data (I)
• Shared Heap (H)
• Activity Stacks (S)
Immutable Data (I) -- final variables,
value type instances
LocallySynchronous
(coherent access to intra-place shared heap)
. . .
Activities
Pla
ce
0
X10 Tutorial 42
Scalable X10: using multiple places
• Place = collection of activities & objects
− Activities and data objects do not move after being created
• Scalar object, O -- maps to a single place specified by O.location
• Array object, A – may be local to a place or distributed across multiple places, as specified by A.distribution
Storage classes:
• Immutable Data (I)
• PGAS
− Local Heap (LH)
− Remote Heap (RH)
• Activity Stacks (S)
LocallySynchronous
(coherent access to intra-place shared heap)
Activity Stacks (S)
Local Heap (LH)
Immutable Data (I) -- final variables, value type instances
. . .
Activities
Activity Stacks (S)
Local Heap (LH)
. . .
Activities
Outbound activities
Inbound activities
Outbound activityreplies
Inbound activity replies
. . .
GloballyAsynchronous
Partitioned Global Address Space (PGAS)
Place 0 Place (MAX_PLACES -1)
X10 Tutorial 43
Locality Rule
• Any access to a mutable (shared heap) datum must be performed by an activity located at the place as the datum
• The prohibited references are similar as before:
− LH/RH S, I S, S S
• Local-to-remote (LH RH) and remote-to-local (RH LH) heap references are freely permitted
• However, direct access via a remote heap reference is not permitted!
• Inter-place data accesses can only be performed by creating remote activities (with weaker ordering guarantees than intra-place data accesses)
• The locality rule is currently not checked by default. Instead, the user can perform the check explicitly by inserting a place cast operator as follows:
− “(@ P) E” checks if expression E can be evaluated at place P• If so, expression E is evaluated as usual• If not, a BadPlaceException is thrown
X10 Tutorial 44
Activity Execution within a Place
Outbound activities
Inbound activities
Outboundreplies Inbound
replies
Place
Ready Activities
CompletedActivities
BlockedActivities
Clock
Future
ExecutingActivities
. . .
Atomic sections do not have blocking
semantics
Place-local activity can only its stack (S), place-local heap (LH), or immutable data (I)
X10 Tutorial 45
Places• place.MAX_PLACES = total number of places
− Default value is 4
− Can be changed by using the -NUMBER_OF_LOCAL_PLACES option in x10 command
• place.places = Set of all places in an X10 program(see java.lang.Set)
• place.factory.place(i) = place corresponding to index i
• here = place in which current activity is executing
• <place-expr>.toString() returns a string of the form “place(id=99)”
• <place-expr>.id returns the id of the place
X10 Places
System Nodes
X10 language defines mapping from X10 objects to X10 places, and abstract
performance metrics on places
X10 Data Structures
Future X10 deployment system will define mapping from X10 places to system nodes;
not supported in current implementation
X10 Tutorial 46
Extension of async and future to places
• async (P) S
− Creates new activity to execute statement S at place P
− “async S” is equivalent to “async (here) S”
• future (P) { E }
− Create new activity to evaluate expression E at place P
− “future { E } ” is equivalent to “future (here) { E }”
• Note that “here” in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing
• The goal is to specify the destination place for async/future activities so as to obey the Locality Rule e.g.,
− async (O.location) O.x = 1;
− future<int> F = future (A.distribution[i]) { A[i] } ;
X10 Tutorial 47
Distribution = mapping from region to places
• Creating distributions (x10.lang.dist):
− dist D1 = R-> here; // local distribution – maps region R to here
− dist D2 = dist.factory.block(R); // blocked distribution
− dist D3 = dist.factory.cyclic(R); // cyclic distribution
− dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1]
• Using distributions
− D[P] = place to which point P is mapped by distribution D (assuming that P is in D.region)
− Allocate a distributed array e.g., T[.] A = new T[ D ];• Allocates an array with index set = D.region, such that element
A[P] is located at place D[P] for each point P in D.region• NOTE: “new T[R]” for region R is equivalent to “new T[R->here]”
− Iterating over a distribution – generalization of “foreach” to “ateach”• ateach is discussed in more detail later
X10 Tutorial 48
Operations defined on distributions
• D.region = source region of distribution
• D.rank = rank of D.region
• D | R = region restriction for distribution D and region R (returns a restricted distribution)
• D | P = place restriction for distribution D and place P (returns region mapped by D to place P)
• D1 || D2 = union of distributions D1 and D2 (assumes that D1.region and D2.region are disjoint)
• D1.overlay(D2); // Overlay of D2 over D1 – asymmetric union
• D.contains(p) = true iff D.region contains point p
• D = R -> P, constant distribution which maps entire region R to place P
• D1 – D2 = distribution difference = D1 | (D1.region – D2.region)
• D.distributionEfficiency() = load balance efficiency of distribution D
X10 Tutorial 49
Inter-place communication using async and future
• Question: how to assign A[i] = B[j], when A[i] and and B[j] may be in different places?
• Answer #1 --- use nested async’s!
finish async ( B.distribution[j] ) {
final int bb = B[j];
async ( A.distribution[i] ) A[i] = bb;
}
• Answer #2 --- use future-force and an async!
final int b = future (B.distribution[j]) { B[j] }.force();
finish async ( A.distribution[i] ) A[i] = b;
X10 Tutorial 50
Load Balance Efficiency
• Consider a parallel application that is executed on P places
• Let T(i) = computation load mapped to place i
− For distribution D, T(i) = (D | place.factory.place(i)).size()
• Let Tmax = max { T(i) | 1 <= i <= P }
• Let E = SUM { T(i) | 1 <= i <= P } / (Tmax * P)
• E is the load balance efficiency, 1/P <= E <= 1
• E = 1 is the best case computation load is perfectly balanced
• E = 1/P is the worst case computation load is placed on a single processor/place
• Load balance efficiency is one of the key factors that limit speedup on a parallel machine
− there are several other factors e.g., comm. & synchronization overhead
− ignoring other factors, we expect speedup to be <= E * P
• NOTE: also try “x10 –DUMP_STATS_ON_EXIT=true …” to see activity and atomic counts
X10 Tutorial 51
ateach loop (distributed parallel iteration)
• The X10 ateach loop is similar to the foreach loop, except that each iteration executes in parallel at a place specified by a distribution
− “ateach ( point p : D ) S ” is equivalent to “for ( point p : D.region ) async (D[p]) S”
• As before, finish can be used to wait for termination of all ateach iterations− “finish ateach( point[i] : dist.factory.unique() ) S” creates one
activity per place, as in an SPMD computation− ateach is a convenient construct for writing parallel matrix
code that is independent of the underlying distribution e.g.,• ateach ( point p : A.distribution ) A[p] = f(B[p], C[p], D[p]) ;
X10 Tutorial 52
Example (see TutAteach1.x10)
public class TutAteach1 {
public static void main(String args[]) {
finish ateach( point[i] : dist.factory.unique() ) {
System.out.println("Hello from " + i);
}
} // main()
} // TutAteach1
Console output:Hello from 1Hello from 0Hello from 3Hello from 2
dist.factory.unique() maps point i in the region, [0 : place.MAX_PLACES-1], to place place.factory.place(i)
X10 Tutorial 53
Example: converting foreach to ateach (see TutAteach2.x10)
foreach version:
// For the A[i,j] = F(A[i,j]) case,
// both loops can execute in parallel
finish foreach ( point[i,j] : A.region )
A[i,j] = A[i,j] + 1;
ateach version #1:
finish ateach ( point[i,j] : A.distribution)
A[i,j] = A[i,j] + 1;
ateach version #2 (create only one activity per place):
finish ateach ( point p : dist.factory.unique() )
for ( point[i,j] : A.distribution | here )
A[i,j] = A[i,j] + 1;
X10 Tutorial 54
Example: converting foreach to ateach, contd. (see TutAteach2.x10)
foreach version:
// For the A[i,j] = F(A[i,j-1]) case,
// only the outer loop can execute in parallel
finish foreach ( point[i] : [1:N] )
for ( point[j]: [2:N] )
A[i,j] = A[i,j-1] + 1;
ateach version:
// Assume that N is a multiple of place.MAX_PLACES
finish ateach ( point[i] : dist.factory.block([1:N]) )
for ( point[j]: [2:N] )
A[i,j] = A[i,j-1] + 1;
X10 Tutorial 55
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 56
X10 clocks: Motivation
• Activity coordination using finish and force() is accomplished by checking for activity termination
• However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination
− The activities involved may be in the same place or in different places
Activity 0 Activity 1 Activity 2 . . .
Phase 0
Phase 1
. . .
X10 Tutorial 57
X10 Clocks
clock c = clock.factory.clock();
− Allocate a clock, register current activity with it. Phase 0 of c starts.
async(…) clocked (c1,c2,…) S
ateach(…) clocked (c1,c2,…) S
foreach(…) clocked (c1,c2,…) S
− Create async activities registered on clocks c1, c2, …
c.resume();
− Nonblocking operation that signals completion of work by current activity for this phase of clock c
next;
− Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed.
− Next can be viewed like a “finish” of all computations under way in the current phase of the clock
X10 Tutorial 58
X10 Clocks (contd.)
c.drop();− Unregister with c. A terminating activity will implicitly drop all clocks that it is
registered on.
c.registered()− Return true iff current activity is registered on clock c− c.dropped() returns the opposite of c.registered()
ClockUseException− Thrown if an activity attempts to transmit or operate on a clock that it is not
registered on
X10 Tutorial 59
Example (see TutClock1.x10) finish async {
final clock c = clock.factory.clock();
foreach (point[i]: [1:N]) clocked (c) {
while ( true ) {
int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]);
if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]);
if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]);
A[i] = new_A_i;
next;
int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]);
if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]);
if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]);
B[i] = new_B_i;
next;
if ( old_A_i == new_A_i && old_B_i == new_B_i ) break;
} // while
} // foreach
} // finish async
NOTE: exiting from while loop terminates activity
for iteration i, and automatically deregisters
activity from clock
Example of transmitting clock from parent to
child
X10 Tutorial 60
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
X10 Tutorial 61
nullable
• By default, object references in X10 are not allowed to take on the null value
• However, the nullable type constructor can be used to enable certain object references to be set to null, or to compare them with null e.g.,
T1 a;
nullable T2 b;
a = null; // Not allowed
b = null; // Allowed
• NOTE: “const” is simply a shorthand for “static final”
X10 Tutorial 62
extern
• X10 provides a simple mechanism for invoking external functions written in C
• Currently, the C function is restricted to arguments with primitive types or references to “unsafe” X10 arrays
• The X10 program must contain an external declaration of the C function as follows …
static extern char doit(int a, float b)
… and also a statement to ensure that the native DLL, <dll>.dll is loaded
static { System.loadLibrary(“<dll>");}
• The X10 compiler then generates a file called <class>_x10stub.c
• To generate the DLL, the C programmer must compile the C function by including the file jni.h in tehir C function, and must link with the object file obtained from <class>_x10stub.c
X10 Tutorial 63
Outline
1. What is X10?
• background, status
2. Basic X10 (single place)
• async, finish, atomic
• future, force
3. Basic X10 (arrays & loops)
• points, rectangular regions, arrays
• for, foreach
4. Scalable X10 (multiple places)
• places, distributions, distributed arrays, ateach, BadPlaceException
5. Clocks
• creation, registration, next, resume, drop, ClockUseException
6. Basic serial constructs that differ from Java
• const, nullable, extern
7. Advanced topics
• Value types, conditional atomic sections (when), general regions & distributions
• Refer to language spec for details
Recommended