Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
© 2012 IBM Corporation
X10 for Productivity and Performance at Scale
2012 HPC Challenge Class 2
Olivier Tardieu, David Grove, Bard Bloom, David Cunningham, Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat,
Avraham Shinnar, Mikio Takeuchi, Mandana Vaziri
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
© 2012 IBM Corporation
X10
§ Programming language – 8 years of R&D by IBM Research with support from DARPA/HPCS (PERCS) – modern object-oriented language
• evolution of Java • strongly typed, memory safe (garbage collected), pre/postconditions, invariants • constructs for concurrency and distribution
§ Focus on scale – HPC and Big Data – scalable productivity and performance
§ Open source – specification and implementation: http://x10-lang.org – doc: “A Brief Introduction to X10 for the High Performance Programmer”
§ Portable and interoperable – backend compilers: C++, Java, CUDA – transports: sockets (IP & IB), PAMI, MPI, DCMF – architectures: Power (Linux, AIX), x86 (Linux, OSX, Windows), BlueGene/P
2
© 2012 IBM Corporation
Programming Model: APGAS Asynchronous Partitioned Global Address Space
3
…
Ac%vi%es
Local Heap
Place 0
…
… …
Ac%vi%es
Local Heap
Place N
…
Global Reference
Concurrency • async S • finish S
Distribution • at(P) S • at(P) e
§ Two basic ideas: places and asynchrony
© 2012 IBM Corporation
Scalable Productivity and Performance
§ Scalable APGAS – many activities
• work-stealing scheduler – many places
• finish specialization via static analysis, dynamic analysis & pragmas
§ High-performance interconnects – RDMAs
• /* finish */ Array.asyncCopy(srcArray, dstArray); – collectives
• Team.WORLD.barrier(role);
X10 delivers high productivity and high performance at peta scale
© 2012 IBM Corporation
§ Compute Node – 32 Power7 cores 3.84 GHz – 128 GB memory – peak: 982 Gflops – Torrent interconnect
§ Drawer – 8 nodes
§ Rack – 8 to 12 drawers
§ Full System – up to 1740 compute nodes
• from 1470 in initial submission – up to 55680 cores – up to 1.7 Petaflops
(1 Petaflops with 1024 compute nodes)
PERCS Prototype (Power 775)
© 2012 IBM Corporation
Performance at Scale
cores perf. at scale parallel efficiency relative perf. X10 LOCs C LOCs
STREAM 55,680 397 TB/s 98% vs. 1 node 85% vs. class 1 60 0
FFT 32,768 27/35 Tflops 93% vs. 1 node 42% vs. class 1 213 23+FFTE
HPL 32,768 589 Tflops 80% vs. 1 core 80% vs. class 1 588 120+BLAS
RA 32,768 844 Gups 100% vs. 1 drawer 76% vs. class 1 143 0
UTS 55,680 596 B nodes/s 98% vs. 1 core >> MPI, UPC 493 137+SHA1
§ FFT – 27 TFlops with 1024 hosts – 35 TFlops with 32,768 cores spread across 1740 hosts – FFTE: 1094 C LOCs
§ UTS: Unbalanced Tree Search – law: fixed geometric – at scale: 69,312,400,211,958 nodes in 116s. – SHA1: 505 C LOCS implementation
© 2012 IBM Corporation
Performance Graphs
589231 22.4
18.0
16.00
18.00
20.00
22.00
24.00
0
200000
400000
600000
800000
0 16384 32768
Gflo
ps/p
lace
Gflo
ps
Places
G-HPL
396614
7.23
7.12
0
5
10
15
0
100000
200000
300000
400000
500000
0 27840 55680
GB
/s/p
lace
GB
/s
Places
EP Stream (Triad)
844
0.82 0.82
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
100
200
300
400
500
600
700
800
900
0 8192 16384 24576 32768
Gup
s/pl
ace
Gup
s
Places
G-RandomAccess
26958
0.88
0.82
0.00 0.20 0.40 0.60 0.80 1.00 1.20
0 5000
10000 15000 20000 25000 30000
0 16384 32768 G
flops
/pla
ce
Gflo
ps
Places
G-FFT
356344
596451 10.93 10.87
10.71
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
0
100000
200000
300000
400000
500000
600000
700000
0 13920 27840 41760 55680
Mill
ion
node
s/s/
plac
e
Mill
ion
node
s/s
Places
UTS
© 2012 IBM Corporation
Unbalanced Tree Search
§ Problem statement – count nodes in randomly generated tree – separable random number generator => work can be relocated – highly unbalanced => need for distributed load balancing
§ Key insights – lifeline-based global work stealing [PPoPP’11]
• n random victims then p lifelines (hypercube) – compact work queue (for wide but shallow trees)
• thief steals half of each work item – finish only accounts for lifelines – sparse communication graph
• bounded list of potential random victims • finish trades contention for latency
X10 enables the rapid exploration of the design space leading to a scalable algorithm