Upload
maria-maynard
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
X10: An Object-Oriented Approach to Non-uniform Cluster Computing
Vijay Saraswat
IBM Research
July 23, 2003 IBM PL Day 2005 2
Overview
Introduction and context Clustered Computing
Language model and constructs Big picture places, atomic, async, finish, clocks, arrays
Example programs and demo Conclusion and Future Work
Guarantees Challenges
July 23, 2003 3
Acknowledgements X10 Tools Julian Dolby, Steve Fink, Robert
Fuhrer, Matthias Hauswirth, Peter Sweeney, Frank Tip, Mandana Vaziri
University partners: MIT (StreamIt), Purdue University
(X10), UC Berkeley (StreamBit), U. Delaware (Atomic sections), U. Illinois (Fortran plug-in), Vanderbilt University (Productivity metrics), DePaul U (Semantics)
X10 core team Philippe Charles Chris Donawa (IBM Toronto) Kemal Ebcioglu Christian Grothoff (Purdue) Allan Kielstra (IBM Toronto) Maged Michael Christoph von Praun Vivek Sarkar
Additional contributors to X10 ideas:
David Bacon, Bob Blainey, Perry Cheng, Julian Dolby, Guang Gao (U Delaware), Robert O'Callahan, Filip Pizlo (Purdue), Lawrence Rauchwerger (Texas A&M), Mandana Vaziri, Jan Vitek (Purdue), V.T. Rajan, Radha Jagadeesan (DePaul)
X10 PM+Tools Team Lead: Kemal Ebcioglu, Vivek SarkarPERCS Principal Investigator: Mootaz Elnozahy
July 23, 2003 IBM PL Day 2005 4
Performance and Productivity Challenges
1) Memory wall: Architectures exhibit severe non-uniformities in bandwidth & latency in memory hierarchy
Clusters (scale-out)
SMP
Multiple cores on a chip
Coprocessors (SPUs)
SMTs
SIMD
ILP. . . L3 Cache
Memory
. . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $ . . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $
. . .
. . .
. . .
2) Frequency wall: Architectures introduce hierarchical heterogeneous parallelism to compensate for frequency scaling slowdown
3) Scalability wall: Software will need to deliver ~ 105-way parallelism to utilize peta-scale parallel systems
July 23, 2003 5
High Complexity Limits Development Productivity
HPC Software Lifecycle
Production Runs of
Parallel Code
Re
qu
ire
me
nts
Inp
ut
Da
ta
Wri
tte
nS
pe
cif
ica
tio
n
Alg
ori
thm
De
ve
lop
me
nt
So
urc
e C
od
e Development of Parallel Source Code ---Design, Code,
Test, Port,Scale, OptimizeP
ara
lle
lS
pe
cif
ica
tio
n
Maintenance and Porting of Parallel Code
L3 Cache
Memory
. . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $ . . .
L2 Cache
PEs,L1 $
Proc ClusterPEs,L1 $
. . .
. . .
. . .
On
e b
illio
n t
ran
sist
ors
in a
ch
ip
\\
1995: entire chip can be accessed in 1 cycle
2010: only small fraction of chip can be accessed in 1 cycle
Major sources of complexity for application developer:1) Severe non-uniformities in data accesses2) Applications must exhibit large degrees of parallelism (up to ~ 105 threads)
Complexity leads to increases in all phases of HPC Software Lifecycle
related to parallel code
// //
July 23, 2003 6
PERCS Programming Model/Tools: Overall ArchitectureX10 source code
Productivity Metrics
X10 Development
Toolkit
Fortran/MPI/OpenMP)
Java Development
Toolkit
Integrated Programming Environment: Edit, Compile, Debug, Visualize, Refactor
Use Eclipse platform (eclipse.org) as foundation for integrating tools
Morphogenic Software: separation of concerns, separation of roles
C/C++ /MPI /OpenMP
C Development
Toolkit
Java+Threads+Conc utils
Fortran Development
Toolkit
Continuous Program Optimization (CPO)
PERCS System Software (K42)
PERCS System Hardware
. . .
. . .
X10 Components
X10 runtime
Integrated Concurrency Library: messages, synchronization, threads
Fortran components
C/C++ components
Fortran runtime C/C++ runtime
Java components
Java runtime
PerformanceExploration
PERCS = ProductiveEasy-to-use ReliableComputer Systems
Fast externinterface
July 23, 2003 7
Scalability Axiom: Programmer must have explicit
language constructs to deal with non-uniformity of access.
Axiom: Allow specification of a large collection of activities.
Axiom: A program must use scalable synchronization constructs.
Axiom: The runtime may implement aggregate operations more efficiently than user-specified iterations with index variables.
Axiom: The user may know more than the compiler/RTS.
X10 Design Assumptions
Productivity Axiom: OO provides proven baseline
productivity, maintenance, portability benefits.
Axiom: Design must rule out large classes of errors (Type safe, Memory safe, Pointer safe, Lock safe, Clock safe …)
Axiom: Design must support incremental introduction of explicit place types/remote operations.
Axiom: PM must integrate with static tools (Eclipse) -- flag performance problems, refactor code, detect races.
Axiom: PM must support automatic static and dynamic optimization (CPO).
Support High Productivity (&, possibly ) High Performance Programmer
July 23, 2003 8
The X10 Programming Model
A program is a collection of places, each containing resident data and a dynamic collection of activities.
Program may distribute aggregate data (arrays) across places during allocation.
Program may directly operate only on local data, using atomic blocks.
Program may spawn multiple (local or remote) activities in parallel.
Program must use asynchronous operations to access/update remote data.
Program may detect termination or (repeatedly) detect quiescence of a data-dependent, distributed set of activities.
Shared Memory (P=1) MPI (P > 1)
Cluster Computing: Common framework for P>=1
heap
stack
control
heap
stack
control
. . .
Activities &Activity-local storage
Place-local heap
Partitioned Global heap
heap
stack
control
heap
stack
control
. . .
Place-local heap
Partitioned Global heapOutbound activities
Inbound activities
Outbound activityreplies
Inbound activity replies
. . .
Place Place
Activities &Activity-local storage
Immutable Data
Granularity of place can range from single register file to an entire SMP system
atomic, when finish, clock
async, {at/for}each
distribution
place
Formalized in Saraswat, Jagadeesan “Concurrent Clustered Programming”.
July 23, 2003 IBM PL Day 2005 9
async
async (P) S Parent activity creates a
new child activity at place P, to execute statement S; returns immediately.
S may reference final variables in enclosing blocks.
double A[D]=…; // Global dist. arrayfinal int k = …;async ( A.distribution[99] ) { // Executed at A[99]’s place atomic A[99] = k; }
async PlaceExpressionSingleListopt Statement
cf Cilk’s spawn
July 23, 2003 10
finish
finish S Execute S, but wait until all
(transitively) spawned async’s have terminated.
Trap all exceptions thrown by spawned activities.
Throw an (aggregate) exception if any spawned async terminates abruptly.
Useful for expressing “synchronous” operations on remote data And potentially, ordering
information in a weakly consistent memory model
finish ateach(point [i]:A) A[i] = i; finish async(A.distribution[j]) A[j] = 2; // All A[i]=i will complete before A[j]=2;
Statement ::= finish Statement
Rooted Exception Model
finish ateach(point [i]:A) A[i] = i; finish async(A.distribution[j]) A[j] = 2; // All A[i]=i will complete before A[j]=2;
cf Cilk’s sync
July 23, 2003 IBM PL Day 2005 11
atomic
Atomic blocks are Conceptually executed in a
single step, while other activities are suspended
An atomic block may not include Blocking operations Accesses to data at remote
places Creation of activities at
remote places
// push data onto concurrent list-stackNode<int> node=new Node<int>(17);atomic { node.next = head; head = node; }
// target defined in lexically enclosing environment.public atomic boolean CAS( Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false;}
Statement ::= atomic StatementMethodModifier ::= atomic
July 23, 2003 IBM PL Day 2005 12
when
Activity suspends until a state in which the guard is true; in that state the body is executed atomically.
Statement ::= WhenStatementWhenStatement ::= when ( Expression ) Statement
class OneBuffer { nullable Object datum = null; boolean filled = false; public void send(Object v) { when ( !filled ) { this.datum = v; this.filled = true; } } public Object receive() { when ( filled ) { Object v = datum; datum = null; filled = false; return v; } }}
July 23, 2003 IBM PL Day 2005 13
regions, distributions
Region a (multi-dimensional) set of
indices Distribution
A mapping from indices to places
High level algebraic operations are provided on regions and distributions
region R = 0:100;
region R1 = [0:100, 0:200];
region RInner = [1:99, 1:199];
// a local distribution
distribution D1=R-> here;
// a blocked distribution
distribution D = block(R);
// union of two distributions
distribution D = (0:1) -> P0 || (2:N) -> P1;
distribution DBoundary = D – RInner;
Based on ZPL.
July 23, 2003 IBM PL Day 2005 14
arrays
Array section A [RInner]
High level parallel array, reduction and span operators Highly parallel library
implementation A-B (array subtraction) A.reduce(intArray.add,0) A.sum()
Arrays may be Multidimensional Distributed Value types Initialized in parallel: int [D] A= new int[D]
(point [i,j]) {return N*i+j;};
July 23, 2003 IBM PL Day 2005 15
ateach, foreach
ateach (point p:A) S Creates |region(A)| async
statements Instance p of statement S
is executed at the place where A[p] is located
foreach (point p:R) S Creates |R| async
statements in parallel at current place
Termination of all activities can be ensured using finish.
ateach ( FormalParam: Expression ) Statementforeach ( FormalParam: Expression ) Statement
public boolean run() {
distribution D = distribution.factory.block(TABLE_SIZE);
long[.] table = new long[D] (point [i]) { return i; }
long[.] RanStarts = new long[distribution.factory.unique()]
(point [i]) { return starts(i);};
long[.] SmallTable = new long value[TABLE_SIZE]
(point [i]) {return i*S_TABLE_INIT;};
finish ateach (point [i] : RanStarts ) {
long ran = nextRandom(RanStarts[i]);
for (int count: 1:N_UPDATES_PER_PLACE) {
int J = f(ran);
long K = SmallTable[g(ran)];
async atomic table[J] ^= K;
ran = nextRandom(ran);
}}
return table.sum() == EXPECTED_RESULT;
}
July 23, 2003 IBM PL Day 2005 16
clocks Operations
clock c = new clock();c.resume();
Signals completion of work by activity in this clock phase.
next; Blocks until all clocks it is
registered on can advance. Implicitly resumes all clocks.
c.drop(); Unregister activity with c.
async (P) clock (c1,…,cn)S (Clocked async): activity is
registered on the clocks (c1,…,cn)
Static Semantics An activity may operate only on
those clocks it is live on. In finish S,S may not
contain any top-level clocked asyncs.
Dynamic Semantics A clock c can advance only
when all its registered activities have executed c.resume().
No explicit operation to register a clock.
Supports over-sampling, hierarchical nesting.
July 23, 2003 IBM PL Day 2005 17
Example: SpecJBB
finish async { clock c = new clock(); Company company = createCompany(...); for (int w : 0:wh_num) for (int t: 0:term_num) async clocked(c) { // a client initialize; next; //1. while (company.mode!=STOP) { select a transaction; think; process the transaction; if (company.mode==RECORDING) record data; if (company.mode==RAMP_DOWN) { c.resume(); //2. } } gather global data; } // a client
// master activity
next; //1.
company.mode = RAMP_UP;
sleep rampuptime;
company.mode = RECORDING;
sleep recordingtime;
company.mode = RAMP_DOWN;
next; //2.
// All clients in RAMP_DOWN
company.mode = STOP;
} // finish
// Simulation completed.
print results.
July 23, 2003 18
Formal semantics (FX10)
Based on Middleweight Java (MJ)
Configuration is a tree of located processes Tree necessary for finish.
Clocks formalized using short circuits (PODC 88).
Bisimulation semantics.
Basic theorems Equational laws Clock quiescence is
stable. Monotonicity of places. Deadlock freedom (for
language w/out when).
… Type Safety … Memory Safety
July 23, 2003 IBM PL Day 2005 19
Current Status
We have an operational X10 0.41 implementation All programs shown here run.
Analysis passes
X10 source
AST
Parser
Code Templates
Code emitter
Annotated AST
X10 Grammar
Target Java
JVM
X10 Multithreaded
RTSNative code
Program outputStructure
•Translator based on Polyglot (Java compiler framework)
•X10 extensions are modular.
•Uses Jikes parser generator.
Code metrics
•Parser: ~45/14K*
•Translator: ~112/9K
•RTS: ~190/10K
•Polyglot base: ~517/80K
•Approx 180 test cases.
(* classes+interfaces/LOC)
Limitations
•Clocked final not yet implemented.
•Type-checking incomplete.
•No type inference.
•Implicit syntax not supported.
09/03
02/04
07/04
02/05
07/05
12/05
06/06
PERCS Kickoff
X10 Kickoff
X10 0.32 Spec Draft
X10 Prototype #1
X10 ProductivityStudy
X10 Prototype #2
Open Source Release?
PEM Events
July 23, 2003 IBM PL Day 2005 20
Future Work: Implementation
Type checking/inference Clocked types Place-aware types
Consistency management Lock assignment for
atomic sections Data-race detection
Activity aggregation Batch activities into a
single thread. Message aggregation
Batch “small” messages.
Load-balancing Dynamic, adaptive migration
of places from one processor to another.
Continuous optimization Efficient implementation of
scan/reduce Efficient invocation of
components in foreign languages C, Fortran
Garbage collection across multiple places
Welcome University Partners and other collaborators.
July 23, 2003 IBM PL Day 2005 21
Future work: Other topics
Design/Theory Atomic blocks Structural study of
concurrency and distribution Clocked types Hierarchical places Weak memory model
Persistence/Fault tolerance
Database integration
Tools Refactoring language.
Applications Several HPC programs
planned currently. Also: web-based
applications.
Welcome University Partners and other collaborators.
Backup material
July 23, 2003 IBM PL Day 2005 23
Type system
Value classes May only have final fields. May only be subclassed
by value classes. Instances of value
classes can be copied freely between places.
nullable is a type constructor nullable T contains the
values of T and null.
Place types: T@P, specify the place at which the data object lives.
Future work: Include generics and dependent types.
July 23, 2003 IBM PL Day 2005 24
Example: Latch
public class Latch implements future { protected boolean forced = false; protected nullable boxed result = null; protected nullable exception z = null;
public atomic boolean setValue( nullable Object val, nullable exception z ) { if ( forced ) return false; // these assignment happens only once. this.result .val= val; this.z = z; this.forced = true; return true; public atomic boolean forced() { return forced; } public Object force() { when ( forced ) { if (z != null) throw z; return result; } }}
public interface future { boolean forced(); Object force();}
public class boxed {
nullable Object val;
}