Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with

Compiling to Avoid Communication

Kathy Yelick

Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory

EECS Professor, UC Berkeley

NERSC Systems and Software Have to Support All of Science

NERSC computing for science •  4500 users, 500 projects •  ~65% from universities, 30% labs •  1500 publications per year!

Systems designed for science •  1.3PF Petaflop Cray system, Hopper •  8 PB filesystem; 250 PB archive • Several systems for genomics,

astronomy, visualization, etc. ~650 applications with these programming models

•  75% Fortran, 45% C/C++, 10% Python •  85% MPI, 25% with OpenMP •  10% PGAS or global objects

These are self-reported, likely low 2

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+14

1.E+15

1.E+16

1.E+17

1.E+18

1990 1995 2000 2005 2010 2015 2020

Computational Science has Moved through Difficult Technology Transitions

Application Performance Growth (Gordon Bell Prizes)

3

Attack of the “killer micros”

Attack of the “killer cellphones”?

The rest of the computing world gets parallelism

Essentially, all models are wrong, but some are useful.

-- George E. Box, Statistician

4

Computing Performance Improvements will be Harder than Ever

0

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (Thousands)

Moore’s Law continues, but power limits performance growth. Parallelism is used instead.

0

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (Thousands) Frequency (MHz) Power (W)

0

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

1970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (Thousands) Frequency (MHz) Power (W) Cores

5

Energy Efficient Computing is Key to Performance Growth

goal

usual scaling

2005 2010 2015 2020

6

At $1M per MW, energy costs are substantial •  1 petaflop in 2010 used 3 MW •  1 exaflop in 2018 would use 100+ MW with “Moore’s Law” scaling

This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.

Communication is Expensive in Time Communication cost has two components:

•  Bandwidth: # of words move / bandwidth •  Latency: # messages * latency

“Communication” in memory hierarchy and network Alas, things are bad and getting worse:

flop_time << 1/bandwidth << latency

Annual improvements [FOSC] Flop_time Bandwidth Latency

59%

Network 26% 15% DRAM 23% 5%

Hard to change: Latency is physics; bandwidth is money!

7

Communication is Expensive in Energy

1

10

100

1000

10000

Pico

Joul

es

now

2018

Intranode/MPI Communication

On-chip / CMP communication

Intranode/SMP Communication

Communication (any off-chip data access including to DRAM) is expensive 8

The Memory Wall Swamp

9

Multicore didn’t cause this, but kept the bandwidth gap growing.

Obama for Communication-Avoiding Algorithms

“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.”

FY 2012 Congressional Budget Request, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.

10

Lessons #1: Compress Data Structures

This is a hard one for compilers....

11

Compression of Meta Data is Important

12

(1)  Estimate Fill(r,c)

Fill(r,c) = (true nonzeros)+ ( fill in explicit zeros)(true nonzeros)

#(true nonzeros) + #(filled in explicit zeros)

#(true nonzeros)

(2) Select RB(r,c) with highest Effective_MFlops(r,c)

Effective_Mflops(r,c) = Mflops(r,c)Fill(r,c) Extra flops,

But still may finish sooner.

Select Register Block size r and c

This is a hard problem for compilers on Sparse matrices!

Offline Analysis to Estimate Mflop/sec

13

Estimate Mflop/s rate using a dense matrix in sparse format

Dense: MFlops(r,c) / Tsopf: Fill(r,c) = Effective_MFlops(r,c)

Selected RB(5x7) with a sample dense matrix

c (num. of cols) c (num. of cols) c (num. of cols)

r (nu

m. o

f row

s)

14

Auto-tuned SpMV Performance

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Results from Williams, Shalf, Oliker, Yelick

15

William’s Roofline Performance Model

peak DP

mul / add imbalance

w/out SIMD

w/out ILP

0.5

1.0

1/8 actual flop:byte ratio

atta

inab

le G

flop/

s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Generic Machine •  Flat top is flop limit •  Slanted is bandwidth limit •  X-Axis is arithmetic

intensity of algorithm

•  Bandwidth-reducing optimizations, such as better cache re-use (tiling), compression improve intensity

Model by Sam Williams

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 32 1/8 1/4

1/2 1/32

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 32 1/8 1/4

1/2 1/32

single-precision peak

double-precision peak

single-precision peak

double-precision peak

RTM/wave eqn.

RTM/wave eqn.

7pt Stencil 27pt Stencil

Xeon X5550 (Nehalem) NVIDIA C2050 (Fermi)

DP add-only

DP add-only

SpMV SpMV

7pt Stencil

27pt Stencil DGEMM

DGEMM

GTC/chargei

GTC/pushi

GTC/chargei

GTC/pushi

Autotuning Gets Kernel Performance Near Optimal

• Roofline model captures bandwidth and computation limits • Autotuning gets kernels near the roof

Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,… 16

Algorithmic intensity: Flops/Word Algorithmic intensity: Flops/Word

Lessons #2: Target Higher Level Loops

Harder than inner loops....

17

Avoiding Communication in Iterative Solvers

•  Consider Sparse Iterative Methods for Ax=b –  Krylov Subspace Methods: GMRES, CG,… –  Can we lower the communication costs?

•  Latency of communication, i.e., reduce # messages by computing multiple reductions at once

•  Bandwidth to memory hierarchy, i.e., compute Ax, A2x, … Akx with one read of A

•  Solve time dominated by: – Sparse matrix-vector multiple (SPMV)

•  Which even on one processor is dominated by “communication” time to read the matrix

– Global collectives (reductions) •  Global latency-limited Joint work with Jim

Demmel, Mark Hoemman, Marghoob Mohiyuddin 18

1 2 3 4 … … 32 x

A·x

A2·x

A3·x

Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]

•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx]

•  Idea: pick up part of A and x that fit in fast memory,

compute each of k products •  Example: A tridiagonal, n=32, k=3 •  Works for any “well-partitioned” A

19

1 2 3 4 … … 32 x

A·x

A2·x

A3·x


•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Sequential Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Saves bandwidth (one read of A&x for k steps) •  Saves latency (number of independent read events)

Step 1 Step 2 Step 3 Step 4

20

1 2 3 4 … … 32 x

A·x

A2·x

A3·x


•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Parallel Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

21

1 2 3 4 … … 32 x

A·x

A2·x

A3·x


•  Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] •  Parallel Algorithm

•  Example: A tridiagonal, n=32, k=3 •  Each processor works on (overlapping) trapezoid •  Saves latency (# of messages); Not bandwidth But adds redundant computation

Proc 1 Proc 2 Proc 3 Proc 4

22

Matrix Powers Kernel on a General Matrix

•  Saves communication for “well partitioned” matrices •  Serial: O(1) moves of data moves vs. O(k) •  Parallel: O(log p) messages vs. O(k log p)

23

Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

For implicit memory management (caches) uses a TSP algorithm for layout

Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax)

Speedups on Intel Clovertown (8 core)

Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick 24

Minimizing Communication of GMRES to solve Ax=b

•  GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A ·∙ v(i-‐1) … SpMV MGS(w, v(0),…,v(i-‐1)) update v(i), H endfor solve LSQ problem with H

CommunicaIon-‐avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” build H from R solve LSQ problem with H

Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k

• Oops – W from power method, precision lost!

25

Matrix Powers Kernel (and TSQR) in GMRES

26

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100

Rel

ativ

eno

rmof

resi

dual

Ax�

bOriginal GMRES

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100

Rel

ativ

eno

rmof

resi

dual

Ax�

bOriginal GMRESCA-GMRES (Monomial basis)

0 200 400 600 800 1000

Iteration count

10�5

10�4

10�3

10�2

10�1

100

Rel

ativ

eno

rmof

resi

dual

Ax�

bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)

Communication-Avoiding Krylov Method (GMRES)

Performance on 8 core Clovertown

27

CA-Krylov Methods Summary and Future Work

•  The Communication-Avoidance works – Provably optimal – Faster in practice

•  Ongoing work for Preconditioning – Handle “hard” matrices that partition

poorly (high surface to volume) –  Idea: separate out dense rows (HSS

matrices) –  [Erin Carson, Nick Knight, Jim Demmel]

28

Lessons #3: Understand Numerics (Or work with someone who does)

Don’t be afraid to change to a “different” right answer

29

Lessons #4: Never Waste Fast Memory

Don’t get hung up on the “owner computes” rule.

30

Beyond Domain Decomposition: 2.5D Matrix Multiply

0

20

40

60

80

100

256 512 1024 2048

Perc

enta

ge o

f m

ach

ine p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D Broadcast-MM2.5D Cannon-MM2D MM (Cannon)

ScaLAPACK PDGEMM Perfect Strong Scaling

•  Conventional “2D algorithms” use P1/2 x P1/2 mesh and minimal memory •  New “2.5D algorithms” use (P/c)1/2 x (P/c)1/2 x c1/2 mesh and c-fold memory

•  Matmul sends c1/2 times fewer words – lower bound •  Matmul sends c3/2 times fewer messages – lower bound

Word by Edgar Solomonik and Jim Demmel

Surprises: •  Even Matrix Multiply had room for

improvement •  Idea: make copies of C matrix (as in prior 3D

algorithm, but not as many) •  Result is provably optimal in communication Lesson: Never waste fast memory Can we generalize for compiler writers?

31

Deconstructing 2.5D Matrix Multiply Solomonick & Demmel

x

z

z

y

x y •  Tiling the iteration space

•  2D algorithm: never chop k dim •  2.5 or 3D: Assume + is

associative; chop k, which is à replication of C matrix

k

j

i Matrix Multiplication code has a 3D iteration space Each point in the space is a constant computation (*/+)

for i for j for k

B[k,j] … A[i,k] … C[i,j] … 32

Lower Bound Idea on C = A*B Iromy, Toledo, Tiskin

33"

x

z

z

y

x y

Cubes in black box with side lengths x, y and z = Volume of black box = x*y*z = (#A□s * #B□s * #C□s )1/2

= ( xz * zy * yx)1/2

k

(i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) * area(B shadow) * area(C shadow)) 1/2

“A shadow”

“C shadow”

j

i

Lessons #5: Understand Theory

Lower bounds help identify optimizations.

34

Traditional (Naïve n2) Nbody Algorithm (using a 1D decomposition)

•  Given n particles and p processors, size M memory

•  Each processor has n/p particles •  Algorithm: shift copy of particles to the left p

times, calculating all pairwise forces •  Computation cost: n2/p •  Communication cost: O(p) messages, O(n)

words 35

............ ............ ............ ............ ............ ............ ............ ............ p

Communication Avoiding Version (using a “1.5D” decomposition)

Solomonik,Yelick

•  Divide p into c groups. Replicate particles within group. –  First row responsible for updating all by orange, second all by green,…

•  Algorithm: shift copy of n/(p*c) particles to the left –  Combine with previous data before passing further level (log steps)

•  Reduce across c to produce final value for each particle

•  Total Computation: O(n2/p); •  Total Communication: O(log(p/c) + log c) messages, O(n*(c/p+1/c)) words

............ ............ ............ ............ ............ ............ ............ ............

............ ............ ............ ............ ............ ............ ............ ............

............ ............ ............ ............ ............ ............ ............ ............

............ ............ ............ ............ ............ ............ ............ ............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

............

Limit: c ≤ p1/2

c

p/c

36

In theory there is no difference between theory and practice, but in

practice there is.

-- Jan L. A. van de Snepscheut, Computer Scientist or -- Yogi Berra, Baseball player and manager

37

Performance Results

•  Over 2x speedup for 24K cores relative to 1D algorithm •  In general, “.5D” replication best for small problems on big machines Driscoll, Georganas, Koanantakool, Solomonik, Yelick (unpublished)

38

10.84 0

0.04

0.08

0.12

0.16

0.2

0.24

96 768 6144 24576

GFlo

ps/

core

Number of cores

Strong scaling for All-Pairs (196,608 particles)

Ideal

c = 16

c = 8

c = 4

c = 2

c = 1

Have We Seen this Idea Before?

•  These algorithms also maximize parallelism beyond “domain decomposition” –  SIMD machine days

•  Automation depends on associative operator for updates (e.g., M. Wolfe)

•  Also used for “synchronization avoidance” in Particle-in-Cell code (Madduri, Su, Oliker, Yelick) –  Replicate and reduce optimization given p copies –  Useful on vectors / GPUs

39

Lessons #6: Aggregate Communication

Pack messages to better amortize per-message communication costs.

40

Lessons #7: Overlap and Pipeline Communication

Sometimes runs contrary to lesson #6 on aggregation

41

Communication Strategies for 3D FFT

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea!

chunk = all rows with same destination

pencil = 1 row

• Three approaches: – Chunk:

• Wait for 2nd dim FFTs to finish • Minimize # messages

– Slab: • Wait for chunk of rows destined

for 1 proc to finish • Overlap with computation

– Pencil: •  Send each row as it completes • Maximize overlap and • Match natural layout slab = all rows in a single plane with

same destination

NAS FT Variants Performance Summary

•  Slab is always best for MPI; small message cost too high •  Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64InfiniBand 256

Elan3 256Elan3 512

Elan4 256Elan4 512

MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPIBest MPIBest UPC

0

100

200

300

400

500

600

700

800

900

1000

1100

Myrinet 64InfiniBand 256

Elan3 256Elan3 512

Elan4 256Elan4 512

MFl

ops

per T

hrea

d

Best NAS Fortran/MPIBest MPI (always Slabs)Best UPC (always Pencils)

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs 64 256 256 512 256 512

MFl

ops

per T

hrea

d

Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)

Lessons #8: Avoid Synchronization, a Particular form Communication

Use One-sided Communication

44

Avoiding Synchronization in Communication

•  Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data –  Wildly popular in HPC, but cumbersome in some applications –  Couples data transfer with synchronization

•  Using global address space decouples synchronization –  Pay for what you need! –  Note: Global Addressing ≠ Cache Coherent Shared memory

address

message id

data payload

data payload one-sided put message

two-sided message

network interface

memory

host CPU

Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group

Performance Advantage of One-Sided Communication

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Ro

un

dtr

ip L

ate

ncy (

usec)

MPI ping-pong

GASNet put+sync

• The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc)

• This is faster for pure data transfers than two-sided send/receive Flood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Pe

rce

nt

HW

pe

ak

MPIGASNet

UPC at Scale on the MILC (QCD) App

0"

20000"

40000"

60000"

80000"

100000"

120000"

512" 1024" 2048" 4096" 8192" 16384" 32768"

Sites&/&Second

&

Number&of&Cores&

UPC"Opt"

MPI"

UPC"Naïve"

47

Shan, Austin, Wright, Strohmaier, Shalf, Yelick (unpublished)

Lessons #9: “Strength-Reduce” Synchronization

Replace global barriers with subset barriers, point-to-point synchronization,

and event-driven execution.

48

49

Avoid Synchronization from Applications

Cholesky 4 x 4

QR 4 x 4

Computations as DAGs View parallel executions as the directed acyclic graph of the computation

Slide source: Jack Dongarra

Event Driven LU in UPC

•  Assignment of work is static; schedule is dynamic •  Ordering needs to be imposed on the schedule

–  Critical path operation: Panel Factorization •  General issue: dynamic scheduling in partitioned memory

–  Can deadlock in memory allocation –  “memory constrained” lookahead

some edges omitted

51

DAG Scheduling Outperforms Bulk-Synchronous Style

UPC vs. ScaLAPACK

0

20

40

60

80

2x 4 pr oc g r i d 4x 4 pr oc g r i d

GFlop

s

ScaLAPACK

UPC

UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding –  New problem in partitioned memory: allocator deadlock –  Can run on of memory locally due tounlucky execution order

PLASMA on shared memory UPC on partitioned memory

PLASMA by Dongarra et al; UPC LU joint with Parray Husbands

Another reason to avoid synchronization

•  Processors do not run at the same speed –  Never did, due to caches –  Power / temperature management makes this worse

60%

52

Lessons #10: Combine Techniques

These are no entirely orthogonal, and there are many interesting trade-offs.

53

Communication Avoidance (2.5) and Overlapped (with UPC) Matrix Multiply

0

10

20

30

40

50

60

70

1,536 6,144 24,576

Perc

enta

ge o

f M

ach

ine P

eak

Number of cores

SUMMA on Hopper (n=32,768)

2.5D-overlp2.5D2D-overlp2D

54 Georganas, Gonzalez-Domınguezy, Solomonik, Zheng, Tourinoy, Yelick (to appear at SC12)

Lessons in Communication Avoidance

1.  Compress Data Structures 2.  Target Higher Level Loops 3.  Understand theory / numerics 4.  Replicate data 5.  Understand theory / lower bounds 6.  Aggregate communication 7.  Overlap communication 8.  Use one-sided communication 9.  Synchronization strength reduction 10.  Combine the techniques

55

Challenges in Exascale Computing

There are many exascale challenges: •  Scaling •  Synchronization, •  Dynamic system behavior •  Irregular algorithms •  Resilience But don’t forget what’s really important

56

Communication Hurts!

57

Documents

Compiling to Avoid Communicationyelick/talks/exascale/Yelick-PACT12.pdf• Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) 23 Joint work with