X10 for Productivity and Performance at Scalex10.sourceforge.net/documentation/hpcc/x10-hpcc2012-slides.pdf · Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar,

© 2012 IBM Corporation

X10 for Productivity and Performance at Scale

2012 HPC Challenge Class 2

Olivier Tardieu, David Grove, Bard Bloom, David Cunningham, Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat,

Avraham Shinnar, Mikio Takeuchi, Mandana Vaziri

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.


X10

§  Programming language – 8 years of R&D by IBM Research with support from DARPA/HPCS (PERCS) – modern object-oriented language

•  evolution of Java •  strongly typed, memory safe (garbage collected), pre/postconditions, invariants •  constructs for concurrency and distribution

§  Focus on scale – HPC and Big Data – scalable productivity and performance

§  Open source – specification and implementation: http://x10-lang.org – doc: “A Brief Introduction to X10 for the High Performance Programmer”

§  Portable and interoperable – backend compilers: C++, Java, CUDA –  transports: sockets (IP & IB), PAMI, MPI, DCMF – architectures: Power (Linux, AIX), x86 (Linux, OSX, Windows), BlueGene/P

2


Programming Model: APGAS Asynchronous Partitioned Global Address Space

3

…

Ac%vi%es

Local Heap

Place 0

…

… …

Ac%vi%es

Local Heap

Place N

…

Global Reference

Concurrency •  async S •  finish S

Distribution •  at(P) S •  at(P) e

§  Two basic ideas: places and asynchrony


Scalable Productivity and Performance

§  Scalable APGAS – many activities

•  work-stealing scheduler – many places

•  finish specialization via static analysis, dynamic analysis & pragmas

§  High-performance interconnects – RDMAs

•  /* finish */ Array.asyncCopy(srcArray, dstArray); – collectives

•  Team.WORLD.barrier(role);

X10 delivers high productivity and high performance at peta scale


§  Compute Node – 32 Power7 cores 3.84 GHz – 128 GB memory – peak: 982 Gflops – Torrent interconnect

§  Drawer – 8 nodes

§  Rack – 8 to 12 drawers

§  Full System – up to 1740 compute nodes

•  from 1470 in initial submission – up to 55680 cores – up to 1.7 Petaflops

(1 Petaflops with 1024 compute nodes)

PERCS Prototype (Power 775)


Performance at Scale

cores perf. at scale parallel efficiency relative perf. X10 LOCs C LOCs

STREAM 55,680 397 TB/s 98% vs. 1 node 85% vs. class 1 60 0

FFT 32,768 27/35 Tflops 93% vs. 1 node 42% vs. class 1 213 23+FFTE

HPL 32,768 589 Tflops 80% vs. 1 core 80% vs. class 1 588 120+BLAS

RA 32,768 844 Gups 100% vs. 1 drawer 76% vs. class 1 143 0

UTS 55,680 596 B nodes/s 98% vs. 1 core >> MPI, UPC 493 137+SHA1

§  FFT – 27 TFlops with 1024 hosts – 35 TFlops with 32,768 cores spread across 1740 hosts – FFTE: 1094 C LOCs

§  UTS: Unbalanced Tree Search –  law: fixed geometric – at scale: 69,312,400,211,958 nodes in 116s. – SHA1: 505 C LOCS implementation


Performance Graphs

589231 22.4

18.0

16.00

18.00

20.00

22.00

24.00

0

200000

400000

600000

800000

0 16384 32768

Gflo

ps/p

lace

Gflo

ps

Places

G-HPL

396614

7.23

7.12

0

5

10

15

0

100000

200000

300000

400000

500000

0 27840 55680

GB

/s/p

lace

GB

/s

Places

EP Stream (Triad)

844

0.82 0.82

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

100

200

300

400

500

600

700

800

900

0 8192 16384 24576 32768

Gup

s/pl

ace

Gup

s

Places

G-RandomAccess

26958

0.88

0.82

0.00 0.20 0.40 0.60 0.80 1.00 1.20

0 5000

10000 15000 20000 25000 30000

0 16384 32768 G

flops

/pla

ce

Gflo

ps

Places

G-FFT

356344

596451 10.93 10.87

10.71

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0

100000

200000

300000

400000

500000

600000

700000

0 13920 27840 41760 55680

Mill

ion

node

s/s/

plac

e

Mill

ion

node

s/s

Places

UTS


Unbalanced Tree Search

§  Problem statement – count nodes in randomly generated tree – separable random number generator => work can be relocated – highly unbalanced => need for distributed load balancing

§  Key insights –  lifeline-based global work stealing [PPoPP’11]

•  n random victims then p lifelines (hypercube) – compact work queue (for wide but shallow trees)

•  thief steals half of each work item –  finish only accounts for lifelines – sparse communication graph

•  bounded list of potential random victims •  finish trades contention for latency

X10 enables the rapid exploration of the design space leading to a scalable algorithm

Documents

X10 for Productivity and Performance at Scalex10.sourceforge.net/documentation/hpcc/x10-hpcc2012-slides.pdf · Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar,