46
Algorithmic Challenges of Exascale Computing Kathy Yelick Associate Laboratory Director for Computing Sciences and Acting NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Embed Size (px)

Citation preview

Page 1: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Algorithmic Challenges of Exascale Computing

Kathy Yelick

Associate Laboratory Director for Computing Sciences and Acting NERSC Center Director

Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

Page 2: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Obama for Communication-Avoiding Algorithms

“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.”

FY 2012 Congressional Budget Request, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.

Page 3: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Energy Cost Challenge for Computing Facilities

At ~$1M per MW, energy costs are substantial •  1 petaflop in 2010 uses 3 MW •  1 exaflop in 2018 possible in 200 MW with “usual” scaling •  1 exaflop in 2018 at 20 MW is DOE target

goal

usual scaling

2005 2010 2015 2020

3

Page 4: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Measuring Efficiency •  Race-to-Halt generally

minimized energy use •  For Scientific Computing

centers, the metric should be science output per Watt….

–  NERSC in 2010 ran at 450 publications per MW-year

–  But that number drops with each new machine

•  Next best: application performance per Watt

–  Newest, largest machine is best –  Lower energy and cost per core

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

$ p

er

core

-ho

ur

Center SysAdmin Power & cooling

Old-HPC Cluster New-HPC

Page 5: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

New Processor Designs are Needed to Save Energy

•  Server processors have been designed for performance, not energy – Graphics processors are 10-100x more efficient – Embedded processors are 100-1000x – Need manycore chips with thousands of cores

5

Cell phone processor (0.1 Watt, 4 Gflop/s)

Server processor (100 Watts, 50 Gflop/s)

Page 6: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

6 1/11/12

The Amdahl Case for Heterogeneity

0  

50  

100  

150  

200  

250  

1   2   4   8   16   32   64   128   256  

Asymmetric  Speedu

p  

Size  of  Fat  core  in  Thin  Core  units  

F=0.999  

F=0.99  

F=0.975  

F=0.9  

F=0.5  

(256 cores) (193 cores)

(1 core)

F is fraction of time in parallel; 1-F is serial

Chip with area for 256 thin cores

A Chip with up to 256 “thin” cores and “fat” core that uses some of the some of the thin core area

256 small cores 1 fat core

Assumes speedup for Fat / Thin = Sqrt of Area advantage

Heterogeneity Analysis by: Mark Hill, U. Wisc

Page 7: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

New Processors Means New Software

•  Exascale will have chips with thousands of tiny processor cores, and a few large ones

•  Architecture is an open question: –  sea of embedded cores with heavyweight “service” nodes –  Lightweight cores are accelerators to CPUs

7

Interconnect Memory Processors

Server Processors Manycore

130 MW 75 MW

Page 8: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Memory Capacity is Not Keeping Pace

Technology trends against a constant or increasing memory per core •  Memory density is doubling every three years; processor logic is every two •  Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs

Source: David Turek, IBM

Cost of Computation vs. Memory

8

Question: Can you double concurrency without doubling memory?

Source: IBM

8

Page 9: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Why avoid communication? •  Running time of an algorithm is sum of 3 terms:

–  # flops * time_per_flop –  # words moved / bandwidth –  # messages * latency

•  Time_per_flop << 1/ bandwidth << latency •  Gaps growing exponentially with time [FOSC]

•  And these are hard to change: •  “Latency is physics, bandwidth is money”

communica1on  

Annual improvements Time_per_flop Bandwidth Latency

Network 26% 15% DRAM 23% 5%

59%

Page 10: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Bandwidth (to Memory and Remote Nodes) is an Energy Hog

1

10

100

1000

10000

Pico

Joul

es

now

2018

Intranode/MPI Communication

On-chip / CMP communication

Intranode/SMP Communication

Page 11: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Value of Local Store Memory

•  Unit stride as important as cache hits on hardware with prefetch –  Don’t cut unit stride when tiling

•  Software controlled memory gives more control (“scrathpad”) –  May also be more for new level of

memory between DRAM and disk

Cell STRIAD (64KB concurrency)

0.000

5.000

10.000

15.000

20.000

25.000

30.000

16 32 64 128 256 512 1024 2048

stanza size

GB

/s

1 SPE 2 SPEs 3 SPEs 4 SPEs5 SPEs 6 SPEs 7 SPEs 8 SPEs

Joint work with Shoaib Kamil, Lenny Oliker, John Shalf, Kaushik Datta

Page 12: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

New Processors Means New Software

•  Exascale will have chips with thousands of tiny processor cores, and a few large ones

•  Architecture is an open question: –  sea of embedded cores with heavyweight “service” nodes –  Lightweight cores are accelerators to CPUs

•  Low power memory and storage technology are key

12

Interconnect Memory Processors

Server Processors Manycore Low power memory and interconnect

130 MW 75 MW

25 Megawatts

Page 13: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Why Avoid Synchronization?

•  Processors do not run at the same speed –  Never did, due to caches –  Power / temperature management makes this worse

60%

HPC can’t turn this off –  Power swings of

50% on systems –  At $3M/MW of

capital costs, don’t want 50% headroom

Page 14: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Errors Can Turn into Performance Problems

•  Fault resilience introduces inhomogeneity in execution rates (error correction is not instantaneous)

Slide source: John Shalf

Page 15: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Challenges to Exascale

1)  System power is the primary constraint 2)  Concurrency (1000x today) 3)  Memory bandwidth and capacity are not keeping pace 4)  Processor architecture is open, but likely heterogeneous 5)  Programming model heroic compilers will not hide this 6)  Algorithms need to minimize data movement, not flops 7)  I/O bandwidth unlikely to keep pace with machine speed 8)  Resiliency critical at large scale (in time or processors) 9)  Bisection bandwidth limited by cost and energy Unlike the last 20 years most of these (1-7) are equally important across scales, e.g., 1000 1-PF machines

Performance Growth

Page 16: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Algorithms to Optimize for Communication

16 16

Page 17: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Avoiding Communication in Iterative Solvers

•  Consider Sparse Iterative Methods for Ax=b –  Krylov Subspace Methods: GMRES, CG,… –  Can we lower the communication costs?

•  Latency of communication, i.e., reduce # messages by computing multiple reductions at once

•  Bandwidth to memory hierarchy, i.e., compute Ax, A2x, … Akx with one read of A

•  Solve time dominated by: – Sparse matrix-vector multiple (SPMV)

•  Which even on one processor is dominated by “communication” time to read the matrix

– Global collectives (reductions) •  Global latency-limited Joint work with Jim

Demmel, Mark Hoemman, Marghoob Mohiyuddin

Page 18: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 32 1/8 1/4

1/2 1/32

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 32 1/8 1/4

1/2 1/32

single-precision peak

double-precision peak

single-precision peak

double-precision peak

RTM/wave eqn.

RTM/wave eqn.

7pt Stencil 27pt Stencil

Xeon X5550 (Nehalem) NVIDIA C2050 (Fermi)

DP add-only

DP add-only

SpMV SpMV

7pt Stencil

27pt Stencil DGEMM

DGEMM

GTC/chargei

GTC/pushi

GTC/chargei

GTC/pushi

Autotuning Gets Kernel Performance Near Optimal

• Roofline model captures bandwidth and computation limits • Autotuning gets kernels near the roof

Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,…

Page 19: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

1 2 3 4 … … 32 x

A·x

A2·x

A3·x

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]

•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx]

•  Idea: pick up part of A and x that fit in fast memory,

compute each of k products •  Example: A tridiagonal, n=32, k=3 •  Works for any “well-partitioned” A

Page 20: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

1 2 3 4 … … 32 x

A·x

A2·x

A3·x

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]

•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Sequential Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Saves bandwidth (one read of A&x for k steps) •  Saves latency (number of independent read events)

Step 1 Step 2 Step 3 Step 4

Page 21: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

1 2 3 4 … … 32 x

A·x

A2·x

A3·x

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]

•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Parallel Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

Page 22: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

1 2 3 4 … … 32 x

A·x

A2·x

A3·x

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]

•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Parallel Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Each processor works on (overlapping) trapezoid •  Saves latency (# of messages); Not bandwidth But adds redundant computation

Proc 1 Proc 2 Proc 3 Proc 4

Page 23: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Matrix Powers Kernel on a General Matrix

•  Saves communication for “well partitioned” matrices •  Serial: O(1) moves of data moves vs. O(k) •  Parallel: O(log p) messages vs. O(k log p)

23

Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

For implicit memory management (caches) uses a TSP algorithm for layout

Page 24: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax)

Speedups on Intel Clovertown (8 core)

Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick

Page 25: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Minimizing Communication of GMRES to solve Ax=b

•  GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2

Standard  GMRES      for  i=1  to  k            w  =  A  ·∙  v(i-­‐1)      …  SpMV            MGS(w,  v(0),…,v(i-­‐1))            update  v(i),  H      endfor      solve  LSQ  problem  with  H    

Communica1on-­‐avoiding  GMRES        W  =  [  v,  Av,  A2v,  …  ,  Akv  ]        [Q,R]  =  TSQR(W)                          …    “Tall  Skinny  QR”        build  H  from  R          solve  LSQ  problem  with  H          

Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k

• Oops  –  W  from  power  method,  precision  lost!  

Page 26: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

TSQR: An Architecture-Dependent Algorithm

W  =    W0  W1  

W2  

W3  

R00  R10  R20  R30  

R01  

R11  R02  

Parallel:  

W  =    W0  W1  

W2  

W3  

R01   R02  R00  

R03  Sequen1al:  

W  =    W0  W1  

W2  

W3  

R00  R01  

R01  R11  

R02  R11  

R03  

Dual  Core:  

Can  choose  reduc1on  tree  dynamically  Mul1core  /  Mul1socket  /  Mul1rack  /  Mul1site  /  Out-­‐of-­‐core:    ?  

Work by Laura Grigori, Jim Demmel, Mark Hoemmen, Julien Langou

Page 27: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

TSQR Performance Results •  Parallel

–  Intel Clovertown – Up to 8x speedup (8 core, dual socket, 10M x 10)

–  Pentium III cluster, Dolphin Interconnect, MPICH •  Up to 6.7x speedup (16 procs, 100K x 200)

–  BlueGene/L •  Up to 4x speedup (32 procs, 1M x 50)

–  Grid – 4x on 4 cities (Dongarra et al) –  Cloud – early result – up and running using Mesos

•  Sequential –  Out-of-Core on PowerPC laptop

•  As little as 2x slowdown vs (predicted) infinite DRAM •  LAPACK with virtual memory never finished

27

Data from Grey Ballard, Mark Hoemmen, Laura Grigori, Julien Langou, Jack Dongarra, Michael Anderson

Page 28: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Matrix Powers Kernel (and TSQR) in GMRES

28

0 200 400 600 800 1000

Iteration count

10−5

10−4

10−3

10−2

10−1

100

Rel

ativ

eno

rmof

resi

dual

Ax−

bOriginal GMRES

0 200 400 600 800 1000

Iteration count

10−5

10−4

10−3

10−2

10−1

100

Rel

ativ

eno

rmof

resi

dual

Ax−

bOriginal GMRESCA-GMRES (Monomial basis)

0 200 400 600 800 1000

Iteration count

10−5

10−4

10−3

10−2

10−1

100

Rel

ativ

eno

rmof

resi

dual

Ax−

bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)

Page 29: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Communication-Avoiding Krylov Method (GMRES)

Performance on 8 core Clovertown

Page 30: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

CA-Krylov Methods Summary and Future Work

•  The Communication-Avoidance works – Provably optimal – Faster in practice

•  Ongoing work for “hard” matrices – Partition poorly (high surface to volume) –  Idea: separate out dense rows (HSS

matrices) –  [Erin Carson, Nick Knight, Jim Demmel]

30

Page 31: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Beyond Domain Decomposition: 2.5D Matrix Multiply

0

20

40

60

80

100

256 512 1024 2048

Perc

enta

ge o

f m

ach

ine p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D Broadcast-MM2.5D Cannon-MM2D MM (Cannon)

ScaLAPACK PDGEMM Perfect Strong Scaling

•  Conventional “2D algorithms” use P1/2 x P1/2 mesh and minimal memory •  New “2.5D algorithms” use (P/c)1/2 x (P/c)1/2 x c1/2 mesh and c-fold memory

•  Matmul sends c1/2 times fewer words – lower bound •  Matmul sends c3/2 times fewer messages – lower bound

Word by Edgar Solomonik and Jim Demmel Lesson: Never waste fast memory

Page 32: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Communication Avoidance in Multigrid: Method of Local Corrections (MLC) for Poisson’s Equation"

MLC uses domain decomposition, plus a noniterative form of multigrid, to represent the solution in a way that minimizes global communication and increases computational intensity."•  Local domains fit in cache; ssolves computed using FFT and a

simplified version of FMM. For 163 - 323, this yields ~5 flops/byte."•  The (global) coarse problem is small. Communication comparable

to single relaxation step."•  Weak scaling: 95% efficiency up to 1024 processors."""

,

Real analytic, with rapidly convergent Taylor expansion

Work by Phil Colella et al

Page 33: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Avoiding Synchronization

33 33

Page 34: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

34

Avoid Synchronization from Applications

Cholesky 4 x 4

QR 4 x 4

Computations as DAGs View parallel executions as the directed acyclic graph of the computation

Slide source: Jack Dongarra

Page 35: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Avoiding Synchronization in Communication

•  Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data –  Wildly popular in HPC, but cumbersome in some applications –  Couples data transfer with synchronization

•  Using global address space decouples synchronization –  Pay for what you need! –  Note: Global Addressing ≠ Cache Coherent Shared memory

address

message id

data payload

data payload one-sided put message

two-sided message

network interface

memory

host CPU

Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group

Page 36: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Event Driven LU in UPC

•  Assignment of work is static; schedule is dynamic •  Ordering needs to be imposed on the schedule

–  Critical path operation: Panel Factorization •  General issue: dynamic scheduling in partitioned memory

–  Can deadlock in memory allocation –  “memory constrained” lookahead

some edges omitted

Page 37: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

37

DAG Scheduling Outperforms Bulk-Synchronous Style

UPC vs. ScaLAPACK

0

20

40

60

80

2x 4  pr oc  g r i d 4x 4  pr oc  g r i d

GFlop

s

ScaLAPACK

UPC

UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding –  New problem in partitioned memory: allocator deadlock –  Can run on of memory locally due tounlucky execution order

PLASMA on shared memory UPC on partitioned memory

PLASMA by Dongarra et al; UPC LU joint with Parray Husbands

Page 38: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Irregular vs. Regular Parallelism

•  Computations with known task graphs can be mapped to resources in an offline manner (before computation starts) –  Regular graph: By a compiler (static) or runtime (semi-static) –  Irregular graphs: By a DAG scheduler

–  No need for online scheduling •  If graphs are not known ahead of time (structure,

task costs, communication costs), then dynamic scheduling is needed –  Task stealing / task sharing –  Demonstrated on shared memory

•  Conclusion: If your task graph is dynamic, the runtime needs to be, but what if it static?

Page 39: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Load Balancing with Locality •  Locality is important:

–  When memory hierarchies are deep –  When computational intensity is low

•  Most (all?) successful examples of locality-important applications/machines use static scheduling –  Unless they have a irregular/dynamic task graph

•  Two extremes are well-studied –  Dynamic parallelism without locality –  Static parallelism (with threads = processors) with locality

•  Dynamic scheduling and locality control don’t mix –  Locality control can cause non-optimal task schedule, which can

blow up memory use (breadth vs. depth first traversal) –  Can run out of memory locally when you don’t globally

Page 40: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

0

2

4

6

8

10

12

14

0

100

200

300

400

500

600

700

1 2 3 6 12

Mem

ory

per n

ode

(GB

)

Tim

e (s

ec)

cores per MPI process

fvCAM (240 cores on Jaguar)

Time

Memory

Use Two Programming Models to Match Machine

Hybrid Programming is key to saving memory (2011) and sometimes improves performance

40

0

2

4

6

8

10

12

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 6 12

Mem

ory

per n

ode

(GB

)

Tim

e (s

ec)

cores per MPI process

PARATEC (768 cores on Jaguar)

Time

Memory

Page 41: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Getting the Best of Each

PGAS for locality and convenience Global address space: directly read/write remote data Partitioned: data is designated as local or global

Glo

bal a

ddre

ss s

pace"

x: 1 y:

l: l: l:

g: g: g:

x: 5 y:

x: 7 y: 0

p0" p1" pn"

•  Affinity control •  Scalability •  Never say “receive”

Debate is around the control model: •  Dynamic thread creation vs. SPMD •  Hierarchical SPMD compromise

•  “Think parallel” and group as needed

Page 42: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Hierarchical SPMD in Titanium •  Thread teams may execute distinct tasks

partition(T) { { model_fluid(); } { model_muscles(); } { model_electrical(); } }

•  Hierarchy for machine / tasks –  Nearby: access shared data –  Far away: copy data

•  Advantages: –  Provable pointer –  Mixed data / task style –  Lexical scope prevents some deadlocks

42

B  C  

D  

A   1  

2  3    4  

span  1  (core  local)  span  2  (processor  local)  span  3  (node  local)  span  4  (global)  

Page 43: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Hierarchical Programming Model: Phalanx

•  Invoke functions on set of cores and set of memories •  Hierarchy of memories

–  Can query to get (some) aspects of the hierarchical structures •  Functionally homogeneous cores (on Echelon)

–  Can query to get (performance) properties of cores •  Hierarchy of thread blocks

–  May be aligned with hardware based on queries

Memory

Memory

Memory

Proc Mem Proc Mem • • •

Proc Mem Proc Mem • • •

Echelon ProgSys Team: Michael Garland, Alex Aiken, Brad Chamberlain, Mary Hall, Greg Titus, Kathy Yelick

Page 44: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Stepping Back •  Communication avoidance as old at tiling •  Communication optimality as old as Hong/Kung •  What’s new?

–  Raising the level of abstraction at which we optimize –  BLAS2 à BLAS3 à LU or SPMV/DOT à Krylov –  Changing numerics in non-trivial ways –  Rethinking methods to models

•  Communication and synchronization avoidance •  Software engineering: breaking abstraction •  Compilers: inter-procedural optimizations

44

Page 45: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Exascale Views •  “Exascale” is about continuing growth in

computing performance for science –  Energy efficiency is key –  Job size is irrelevant

•  Success means: –  Influencing market: HPC, technical computing,

clouds, general purpose –  Getting more science from data and computing

•  Failure means: –  few big machines for a few big applications

•  Not all computing problems are exascale, but they should all be exascale-technology aware

45

Page 46: Algorithmic Challenges of Exascale Computing - ICERM · science output per Watt ... Power & cooling Old-HPC Cluster New-HPC . ... 1 / 4 1/ 32 2 1/ 1 1 2 4 16 32 8 1 4 1 2 1/ 32

Thank You!

46