X10—The Big Picture
© 2007 IBM Corporation1
IBM
Researc
h
X10: The Big Picturehttp://x10.sf.net [email protected]
Vijay Saraswat
July 2007
IBM ResearchThis material is based upon work
supported by the Defense Advanced
Research Projects Agency under its
Agreement No. HR0011-07-9-0002.
X10—The Big Picture
© 2007 IBM Corporation2
IBM
Researc
h
Acknowledgments
Recent Publications
1. “Constrained Types for OO Languages”, Submitted.
2. “Deadlock-free scheduling of X10 Computations with bounded
resources”, SPAA 2007.
3. “A Theory of Memory Models”, PPoPP 2007.
4. “May-Happen-in-Parallel Analysis of X10 Programs”, PPoPP
2007.
5. “An annotation and compiler plug-in system for X10”, IBM
Technical Report, Feb 2007.
6. “Experiences with an SMP Implementation for X10 based on
the Java Concurrency Utilities” Workshop on Programming
Models for Ubiquitous Parallelism (PMUP), September 2006.
7. "An Experiment in Measuring the Productivity of Three
Parallel Programming Languages”, P-PHEC workshop,
February 2006.
8. "X10: An Object-Oriented Approach to Non-Uniform Cluster
Computing", OOPSLA conference, October 2005.
9. "Concurrent Clustered Programming", CONCUR conference,
August 2005.
10. "X10: an Experimental Language for High Productivity
Programming of Scalable Systems", P-PHEC workshop,
February 2005.
Tutorials
� TiC 2006, PACT 2006, OOPSLA 2006, PPoPP 2007
� Graduate course on X10 at U Pisa (07/07)
� X10 Core Team– Rajkishore Barik, Ganesh
Bikshandi, Chris Donawa, Allan Kielstra, Sreedhar Kodali, Nathaniel Nystrom, Igor Peshansky, Christoph von Praun, Vijay Saraswat, Vivek Sarkar, Lex Spoon, Pradeep Varma, Krishna Venkat, Tong Wen
� X10 Tools– Philippe Charles, Julian Dolby,
Robert Fuhrer, Frank Tip, MandanaVaziri
� Emeritus– Kemal Ebcioglu, Christian Grothoff,
Vincent Cave� Research colleagues
– Ras Bodik, Guang Gao, Radha Jagadeesan, Jens Palsberg, RodricRabbah, Jan Vitek
– Vinod Tipparaju, Jarek Nieplocha(PNNL)
– Kathy Yelick, Dan Bonachea(Berkeley)
– Doug Lea (SUNY Oswego)– Several others at IBM
X10—The Big Picture
© 2007 IBM Corporation3
IBM
Researc
h
A new era of mainstream parallel processing
The Challenge Parallelism scaling replaces frequency scaling as foundation forincreased performance � Profound impact on future software
Multi-core chips Cluster ParallelismHeterogeneous Parallelism
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
SMF
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
PXUL1
PPU
16B/cycle
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
LS
SXU
SPU
SMF
. . .
L2 Cache
PEs,L1 $
PEs,L1 $
. . .
. . .. . .
L2 Cache
PEs,L1 $
PEs,L1 $
. . .
Memory
PEs,
SMP Node
PEs,
. . . . . .
Memory
PEs,
SMP Node
PEs,
Interconnect
Our response: Use X10 as a new language for parallel hardware that builds onexisting tools, compilers, runtimes, virtual machines and libraries
X10—The Big Picture
© 2007 IBM Corporation4
IBM
Researc
h
A strategy – radical incrementalism
� One (mainstream) programming model to rule them all.– Multicore, Clusters, Combos– Sequential Java +
concurrency– places, async, finish, clock,
atomic, annotations
� Keep it simple, stupid– Look to aggressively
parallelize sequential programs
– Concurrency as a means to an end
� Address all approaches– VM extensions
– Libraries
– Annotations + extensive static analysis
– New languages
– Domain-specific parallelism
X10—The Big Picture
© 2007 IBM Corporation5
IBM
Researc
h
Themes
� Target productivity, without sacrificing performance.
� Build on what we understand – core sequential OO base.
� Manage concurrency and distribution explicitly.
� Target the “mainstream”programmer; do not ignore the expert.
� Rule out large classes of errors by design – statically as far as possible, else dynamically.
� Keep the language small, orthogonal.
� Make easy things easy, hard things possible.
� Address the entire tool chain: development environment, analysis tools, debugging, …
X10—The Big Picture
© 2007 IBM Corporation6
IBM
Researc
h
Programming Model -- The Big Picture
� Places
– Partition data and activities that operate on the data
– But keep address space global.
– Locality central to heterogeneity, clustering.
� Asynchrony
– Express fine-grained parallelism
– Exploit fork-join structure, recursive partitioning.
� Atomicity
– Specify granularity of atomicity.
– Focus on the desired property, not the mechanism.
� Ordering
– Termination detection
– Quiescence detection
X10—The Big Picture
© 2007 IBM Corporation7
IBM
Researc
h
X10 Programming Model – The Big Picture
• Dynamic parallelism with a Partitioned Global Address Space
• Places encapsulate binding of activities and globally addressable data
• Number of places currently fixed at launch time
• All concurrency is expressed as asynchronous activities – subsumes threads, structured parallelism, messaging, DMA transfers, etc.
• Atomic blocks enforce mutual exclusion of co-located data
• No place-remote accesses permitted in atomic section
• Immutable data offers opportunity for single-assignment parallelism
Storage classes:
� Activity-local
� Place-local
� Partitioned global
� Immutable
X10—The Big Picture
© 2007 IBM Corporation8
IBM
Researc
h
X10 v1.01 Cheat sheetStm:
async [ ( Place ) ] [clocked ClockList ] Stm
atomic Stm
when ( SimpleExpr ) Stm
finish Stm
next; c.resume() c.drop()
for( i : Region ) Stm
foreach ( i : Region ) Stm
ateach ( I : Distribution ) Stm
Expr:
ArrayExpr
ClassModifier : Kind
MethodModifier:
atomic nonblocking sequential
local safe
Type:
DataType
nullable<Type>
future <Type >
Kind :
value | reference
ClassType , InterfaceType :
TypeName
[ DepParameters ] [ PlaceTypeSpecifier ]
x10.lang has the following classes (among others)
point, range, region, distribution, clock, array
Some of these are supported by special syntax.
Annotations: @ InterfaceType
Annotations suffix types and prefix almost all other syntactic elements.
X10—The Big Picture
© 2007 IBM Corporation9
IBM
Researc
h
X10 v1.01 Cheat sheet: Array support
ArrayExpr:
new ArrayType ( Formal ) { Stm }
Distribution Expr -- Lifting
ArrayExpr [ Region ] -- Section
ArrayExpr | Distribution -- Restriction
ArrayExpr || ArrayExpr -- Union
ArrayExpr.overlay(ArrayExpr) -- Update
ArrayExpr. scan( [fun [, ArgList] )
ArrayExpr. reduce( [fun [, ArgList] )
ArrayExpr.lift( [fun [, ArgList] )
ArrayType:
Type [Kind] [ ]
Type [Kind] [ region(N) ]
Type [Kind] [ Region ]
Type [Kind] [ Distribution ]
Region:
Expr : Expr -- 1-D region
[ Range, …, Range ] -- Multidimensional Region
Region && Region -- Intersection
Region || Region -- Union
Region – Region -- Set difference
BuiltinRegion
Dist:
Region -> Place -- Constant distribution
Distribution | Place -- Restriction
Distribution | Region -- Restriction
Distribution || Distribution -- Union
Distribution – Distribution -- Set difference
Distribution.overlay ( Distribution )
BuiltinDistribution
Language supports type safety, memory safety, place safety, clock safety.
X10—The Big Picture
© 2007 IBM Corporation10
IBM
Researc
h
X10 availability
� X10 is an open source project (Eclipse Public License).
� Website: http://x10.sf.net
� Reference implementation in Java, runs on any Java 5 VM.
– Windows/Intel, Linux/Intel
– AIX/PPC, Linux/PPC
– Runs on multiprocessors
� Website contains
– Tutorial material
– Presentations
– Download instructions
– Copies of some papers
– Pointers to mailing list
X10—The Big Picture
© 2007 IBM Corporation11
IBM
Researc
h
Multi-core SMP Implementation
Analysis passesX10
source
AST
X10 Parser
Code
Generation
Templates
Java code emitter
Annotated AST
X10 Grammar
Target Java
DOMO
Static Analyzer
Java compiler
X10FrontEnd
Outbound
activities
Inbound activities
Outbound
repliesInbound replies
Place
Ready Activities
Completed
Activities
Blocked
Activities
Clock
Future
ExecutingActivities
. . .
Atomic sections do
not have blocking semantics
Activity can only access
its stack, place-local mutable data, or global
immutable data
X10 classfiles
(Java classfiles with
special annotations for X10 analysis info)
Java Concurrency Utilities (JCU)
Ready
Activities
Completed
Activities
Blocked
Activities
Clock
Future
Executing
Activities
. . .
Ready
Activities
Completed
Activities
Blocked
Activities
Clock
Future
Executing
Activities
. . .
Place 0 Place 1
. . .
Exte
rn inte
rfa
ce
Fortran,C/C++ DLL’s
X10Runtime
JCU thread pool
High Performance JRE
(IBM J9 VM+ Testarossa JIT
Compilermodified for X10
on PPC/AIX)
Portable Standard
Java 5 RuntimeEnvironment
(Runs on multiple
Platforms)
JavaRuntime
Common components w/ SAFARI
STM library
X10 libraries
Data Marking: DATA
X10—The Big Picture
© 2007 IBM Corporation12
IBM
Researc
h
PERCS Programming Model, Tools
X10 source codeFortran source code
(w/ MPI, OpenMP)
C/C++ source code(w/ MPI, OpenMP, UPC)
JavaTM source code(w/ threads & conc utils) . . .
C/C++components
Fortran components
C/C++ runtime Fortran runtime
. . .
X10 Components
X10 runtime
Integrated Parallel Runtime: MPI + LAPI + RDMA + OpenMP + threads
Java components
Java runtime
Fast externinterface
Dynamic Compilation + Continuous Program Optimization
Text in blue identifies
exploratory PERCS
contributions
Productivity Measurements
X10 Development
Toolkit
Java Development
Toolkit
C/C++Development
Toolkit+ MPI extensions
Fortran Development
Toolkit
. . .
PerformanceExploration
X10 Compiler
Java Compiler
C/C++ Compilerw/ UPC
extensions
Fortran Compiler
. . .
Parallel Tools Platform (PTP)
Eclipse
platform
Refactoring for Concurrency
X10—The Big Picture
© 2007 IBM Corporation13
IBM
Researc
h
X10DT: Enhancing productivity
� Code editing
� Refactoring
� Code visualization
� Data visualization
� Debugging
� Static performance analysis
Vision: State-of-the-art IDE for a modern OO language for HPC
X10 Incremental Builder;Problems View populatedw/ X10 compiler messages
Source editor w/ syntax
highlighting, auto indenting,
some content assist
Outline View populated w/
X10 types, members, loops
X10 Launch Configuration
X10—The Big Picture
© 2007 IBM Corporation14
IBM
Researc
h
Operational X10 implementation (since 02/2005)
Analysis passes
X10 source
AST
Parser
Code Templates
Code emitter
Annotated AST
X10 Grammar
Target Java
JVM
X10 Multithreaded
RTS
Native code
Program output
Structure
• Translator based on Polyglot (Java compiler framework)
• X10 extensions are modular.
• Uses Jikes parser generator.
Code metrics
•Parser: ~45/14K*
•Translator: ~112/9K
•RTS: ~190/10K – revised for JUC
•Polyglot base: ~517/80K
•Approx 280 test cases.
(* classes+interfaces/LOC)
New features 6/07
• Annotations
• X10lib v 1
09/03
02/04
07/04
02/05
07/05
12/05
PERCS Kickoff
X10 Kickoff
X10 0.32 Spec Draft
X10 Prototype #1
X10 ProductivityStudy
X10 Prototype #2
Open Source Release
X10 Compiler (06/2007)
12/06
6/07
Annotations, X10lib v1
X10—The Big Picture
© 2007 IBM Corporation15
IBM
Researc
h
X10Flash
� Distributed runtime
– In C/C++
– On top of messaging library (GASNet, ARMCI, LAPI)
– Targeted for high-performance clusters of SMPs.
� X10lib
– Runtime also made available as a standalone library.
– Supporting global address space, places, asyncs, clocks, futures etc.
� Performance goal
– To be competitive with MPI
� Release schedule
– Internal demonstration 12/07
– External release 2008
X10—The Big Picture
© 2007 IBM Corporation16
IBM
Researc
h
Conclusion
� X10 is intended to span multicore, heterogenoussystems and clusters.
� X10 offers a simple programming model for concurrency and distribution
– Places
– Asynchrony
– Atomicity
– Ordering
� X10 is being developed as an open source project
– Reference implementation available on http://x10.sf.net
� A clustered implementation is being worked on
– Compiles to C/C++
– Uses LAPI for inter-process messaging
– Uses algorithmic scheduling for intra-process scheduling
IBM Research: Software Technology
© 2005 IBM Corporation17
Pro
gra
mm
ing T
echnolo
gie
s
Backup
X10—The Big Picture
© 2007 IBM Corporation18
IBM
Researc
h
Comparison with MPI
Model of parallelism
� MPI: data parallel, SPMD
� X10: structured multithreading, SPMD is a special case.
Synchronization model
� MPI: clear distinction between control synchronization (barriers) and communication (or combine them in a collective).
– synchronous: send / receive
– asynchronous: put/get
� X10: shared memory paradigm
– address space partitioned, • locality of access visible to the programmer
– remote access are asynchronous (but may be made synchronous)
– Supports “active messages”.
X10—The Big Picture
© 2007 IBM Corporation19
IBM
Researc
h
Comparison with UPC
X10 is similar to UPC in ...� shared partitioned global
address space� barrier synchronization:
– UPC split barrier (upc_notify, upc_wait)
– X10 clocks: resume(), next� parallel loops with affinity
– UPC affinity clause in upc_forall
– X10 distribution in ateach
X10 extends UPC in ...� dynamic structured
multithreading (not SPMD)� safety properties (managed
runtime)� distributed computation and
data, concept of places� array language,
multidimensional arrays� critical section synchronization
– simple atomic blocks, not locks
– conditional atomic blocks� type system exposes access
locality� multiple, custom distributions
X10—The Big Picture
© 2007 IBM Corporation20
IBM
Researc
h
Comparison with CILK
X10 is similar to CILK in ...� structured concurrency
– Cilk: spawn, sync– X10: async, finish (block scoped,
rooted exception model), method body not forced to finish spawned asyncs.
– serial elision only for non-blocking activities in X10
X10 extends CILK in ...
� distributed computation and data, concept of places
– intra/inter-place work-stealing scheduler?
� array language, multidimensional arrays
� critical sections
– simple atomic blocks, not locks
– X10 has condition synchronization
� X10 managed runtime and type system: safety properties
X10—The Big Picture
© 2007 IBM Corporation21
IBM
Researc
h
Comparison with Java ™
X10 language builds on the Java language
Shared underlying philosophy: shared syntactic and semantic tradition, simple, small, easy to use, efficiently implementable, machine independent
X10 does not have:� Dynamic class loading� Java’s concurrency features
– thread library, volatile, synchronized, wait, notify
X10 restricts:
� Class variables and static initialization
X10 adds to Java:� value types, nullable� Array language
– Multi-dimensional arrays, aggregate operations
� New concurrency features– activities (async, future), atomic
blocks, clocks� Distribution
– places– distributed arrays
X10—The Big Picture
© 2007 IBM Corporation22
IBM
Researc
h
Overview of green field work
� Compilation for Combos– Cell + Opteron– Graphics cards – NVidia
(CUDA)
� Implement specific annotations and supporting static analyses– Integrate with WALA
(http://wala.sf.net)
� Support Implicit Parallelism– tryasync, dependency
annotations, flow annotations, …
– Static/dynamic support
� Runtime– x10lib – an implementation of
the X10 runtime in C/C++ for high-performance clusters
– Algorithmic scheduling• Design extensions to handle
X10 control constructs.• Evaluate different
scheduling strategies (work-stealing, parallel depth-first scheduling, …)
• Implement
X10—The Big Picture
© 2007 IBM Corporation23
IBM
Researc
h
Algorithmic Scheduling
� Centralized task-queue is a bottleneck
– Each worker must synchronize in order to get new work
� Thread pool may grow unboundedly
– when causes thread to suspend.
� Observation: finish/async programs have series/parallel dependence graphs.
� Solution: use Cilk-style work-stealing scheduler
– Guaranteed O(T1/p + T∞)
runtime, and p*S1 space.
� Todo:
– Investigate alternate schedulers with better space utilization/cache behavior (cfparallel depth-first scheduler)
– Consider hiearchical, multi-place scheduler
– Handle all X10 control constructs.
Preliminary results: very encouraging. Good scaling, absolute performance.
X10—The Big Picture
© 2007 IBM Corporation24
IBM
Researc
h
Speedup for spannning tree on a 32-core Niagara
Runtime ~ 0.5s for p=24, V=5M, E=20M
Speedup on a 32-core machine for SpanF
0
5
10
15
20
Num Cores
Sp
eed
up
V=1M
V=2M
V=3M
V=4M
V=5M
Ideal speedup
V=1M 1 1.86 3.75 7.27 12.90
V=2M 1 1.93 3.76 7.49 12.63
V=3M 1 1.92 3.65 7.32 12.58
V=4M 1 1.91 3.74 7.37 12.71
V=5M 1 1.95 3.82 7.43 13.15
Ideal 1 2.00 4.00 8.00 16.00
1 2 4 8 16
X10—The Big Picture
© 2007 IBM Corporation25
IBM
Researc
h
Speedup for spanning tree on 8-proc Opteron/Solaris
Runtime ~ 1.19s for p=8, V=5M, E=20M
Speedup on 8-way Opteron/Solaris
0
2
4
6
8
10
Num processors
Sp
eed
up
N=5M
N=4M
N=3M
N=2M
N=1M
Optimal speedup
N=5M 1 2.08 3.93 6.97
N=4M 1 1.95 4.11 7.04
N=3M 1 1.94 3.77 6.48
N=2M 1 1.99 3.83 7.00
N=1M 1 2.02 3.88 6.73
Optimal 1 2.00 4 8
1 2 4 8