Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Compiling to Avoid Communication
Kathy Yelick
Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory
EECS Professor, UC Berkeley
NERSC Systems and Software Have to Support All of Science
NERSC computing for science • 4500 users, 500 projects • ~65% from universities, 30% labs • 1500 publications per year!
Systems designed for science • 1.3PF Petaflop Cray system, Hopper • 8 PB filesystem; 250 PB archive • Several systems for genomics,
astronomy, visualization, etc. ~650 applications with these programming models
• 75% Fortran, 45% C/C++, 10% Python • 85% MPI, 25% with OpenMP • 10% PGAS or global objects
These are self-reported, likely low 2
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
1.E+17
1.E+18
1990 1995 2000 2005 2010 2015 2020
Computational Science has Moved through Difficult Technology Transitions
Application Performance Growth (Gordon Bell Prizes)
3
Attack of the “killer micros”
Attack of the “killer cellphones”?
The rest of the computing world gets parallelism
Essentially, all models are wrong, but some are useful.
-- George E. Box, Statistician
4
Computing Performance Improvements will be Harder than Ever
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands)
Moore’s Law continues, but power limits performance growth. Parallelism is used instead.
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands) Frequency (MHz) Power (W)
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands) Frequency (MHz) Power (W) Cores
5
Energy Efficient Computing is Key to Performance Growth
goal
usual scaling
2005 2010 2015 2020
6
At $1M per MW, energy costs are substantial • 1 petaflop in 2010 used 3 MW • 1 exaflop in 2018 would use 100+ MW with “Moore’s Law” scaling
This problem doesn’t change if we were to build 1000 1-Petaflop machines instead of 1 Exasflop machine. It affects every university department cluster and cloud data center.
Communication is Expensive in Time Communication cost has two components:
• Bandwidth: # of words move / bandwidth • Latency: # messages * latency
“Communication” in memory hierarchy and network Alas, things are bad and getting worse:
flop_time << 1/bandwidth << latency
Annual improvements [FOSC] Flop_time Bandwidth Latency
59%
Network 26% 15% DRAM 23% 5%
Hard to change: Latency is physics; bandwidth is money!
7
Communication is Expensive in Energy
1
10
100
1000
10000
Pico
Joul
es
now
2018
Intranode/MPI Communication
On-chip / CMP communication
Intranode/SMP Communication
Communication (any off-chip data access including to DRAM) is expensive 8
The Memory Wall Swamp
9
Multicore didn’t cause this, but kept the bandwidth gap growing.
Obama for Communication-Avoiding Algorithms
“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.”
FY 2012 Congressional Budget Request, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.
10
Lessons #1: Compress Data Structures
This is a hard one for compilers....
11
Compression of Meta Data is Important
12
(1) Estimate Fill(r,c)
Fill(r,c) = (true nonzeros)+ ( fill in explicit zeros)(true nonzeros)
#(true nonzeros) + #(filled in explicit zeros)
#(true nonzeros)
(2) Select RB(r,c) with highest Effective_MFlops(r,c)
Effective_Mflops(r,c) = Mflops(r,c)Fill(r,c) Extra flops,
But still may finish sooner.
Select Register Block size r and c
This is a hard problem for compilers on Sparse matrices!
Offline Analysis to Estimate Mflop/sec
13
Estimate Mflop/s rate using a dense matrix in sparse format
Dense: MFlops(r,c) / Tsopf: Fill(r,c) = Effective_MFlops(r,c)
Selected RB(5x7) with a sample dense matrix
c (num. of cols) c (num. of cols) c (num. of cols)
r (nu
m. o
f row
s)
14
Auto-tuned SpMV Performance
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Results from Williams, Shalf, Oliker, Yelick
15
William’s Roofline Performance Model
peak DP
mul / add imbalance
w/out SIMD
w/out ILP
0.5
1.0
1/8 actual flop:byte ratio
atta
inab
le G
flop/
s
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/4 1/2 1 2 4 8 16
Generic Machine • Flat top is flop limit • Slanted is bandwidth limit • X-Axis is arithmetic
intensity of algorithm
• Bandwidth-reducing optimizations, such as better cache re-use (tiling), compression improve intensity
Model by Sam Williams
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 32 1/8 1/4
1/2 1/32
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 32 1/8 1/4
1/2 1/32
single-precision peak
double-precision peak
single-precision peak
double-precision peak
RTM/wave eqn.
RTM/wave eqn.
7pt Stencil 27pt Stencil
Xeon X5550 (Nehalem) NVIDIA C2050 (Fermi)
DP add-only
DP add-only
SpMV SpMV
7pt Stencil
27pt Stencil DGEMM
DGEMM
GTC/chargei
GTC/pushi
GTC/chargei
GTC/pushi
Autotuning Gets Kernel Performance Near Optimal
• Roofline model captures bandwidth and computation limits • Autotuning gets kernels near the roof
Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,… 16
Algorithmic intensity: Flops/Word Algorithmic intensity: Flops/Word
Lessons #2: Target Higher Level Loops
Harder than inner loops....
17
Avoiding Communication in Iterative Solvers
• Consider Sparse Iterative Methods for Ax=b – Krylov Subspace Methods: GMRES, CG,… – Can we lower the communication costs?
• Latency of communication, i.e., reduce # messages by computing multiple reductions at once
• Bandwidth to memory hierarchy, i.e., compute Ax, A2x, … Akx with one read of A
• Solve time dominated by: – Sparse matrix-vector multiple (SPMV)
• Which even on one processor is dominated by “communication” time to read the matrix
– Global collectives (reductions) • Global latency-limited Joint work with Jim
Demmel, Mark Hoemman, Marghoob Mohiyuddin 18
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx]
• Idea: pick up part of A and x that fit in fast memory,
compute each of k products • Example: A tridiagonal, n=32, k=3 • Works for any “well-partitioned” A
19
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Sequential Algorithm
• Example: A tridiagonal, n=32, k=3 • Saves bandwidth (one read of A&x for k steps) • Saves latency (number of independent read events)
Step 1 Step 2 Step 3 Step 4
20
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Parallel Algorithm
• Example: A tridiagonal, n=32, k=3 • Each processor communicates once with neighbors
Proc 1 Proc 2 Proc 3 Proc 4
21
1 2 3 4 … … 32 x
A·x
A2·x
A3·x
Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = A⋅x with [Ax, A2x, …, Akx] • Parallel Algorithm
• Example: A tridiagonal, n=32, k=3 • Each processor works on (overlapping) trapezoid • Saves latency (# of messages); Not bandwidth But adds redundant computation
Proc 1 Proc 2 Proc 3 Proc 4
22
Matrix Powers Kernel on a General Matrix
• Saves communication for “well partitioned” matrices • Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p)
23
Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin
For implicit memory management (caches) uses a TSP algorithm for layout
Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax)
Speedups on Intel Clovertown (8 core)
Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick 24
Minimizing Communication of GMRES to solve Ax=b
• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2
Standard GMRES for i=1 to k w = A ·∙ v(i-‐1) … SpMV MGS(w, v(0),…,v(i-‐1)) update v(i), H endfor solve LSQ problem with H
CommunicaIon-‐avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” build H from R solve LSQ problem with H
Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k
• Oops – W from power method, precision lost!
25
Matrix Powers Kernel (and TSQR) in GMRES
26
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRES
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRESCA-GMRES (Monomial basis)
0 200 400 600 800 1000
Iteration count
10�5
10�4
10�3
10�2
10�1
100
Rel
ativ
eno
rmof
resi
dual
Ax�
bOriginal GMRESCA-GMRES (Monomial basis)CA-GMRES (Newton basis)
Communication-Avoiding Krylov Method (GMRES)
Performance on 8 core Clovertown
27
CA-Krylov Methods Summary and Future Work
• The Communication-Avoidance works – Provably optimal – Faster in practice
• Ongoing work for Preconditioning – Handle “hard” matrices that partition
poorly (high surface to volume) – Idea: separate out dense rows (HSS
matrices) – [Erin Carson, Nick Knight, Jim Demmel]
28
Lessons #3: Understand Numerics (Or work with someone who does)
Don’t be afraid to change to a “different” right answer
29
Lessons #4: Never Waste Fast Memory
Don’t get hung up on the “owner computes” rule.
30
Beyond Domain Decomposition: 2.5D Matrix Multiply
0
20
40
60
80
100
256 512 1024 2048
Perc
enta
ge o
f m
ach
ine p
eak
#nodes
2.5D MM on BG/P (n=65,536)
2.5D Broadcast-MM2.5D Cannon-MM2D MM (Cannon)
ScaLAPACK PDGEMM Perfect Strong Scaling
• Conventional “2D algorithms” use P1/2 x P1/2 mesh and minimal memory • New “2.5D algorithms” use (P/c)1/2 x (P/c)1/2 x c1/2 mesh and c-fold memory
• Matmul sends c1/2 times fewer words – lower bound • Matmul sends c3/2 times fewer messages – lower bound
Word by Edgar Solomonik and Jim Demmel
Surprises: • Even Matrix Multiply had room for
improvement • Idea: make copies of C matrix (as in prior 3D
algorithm, but not as many) • Result is provably optimal in communication Lesson: Never waste fast memory Can we generalize for compiler writers?
31
Deconstructing 2.5D Matrix Multiply Solomonick & Demmel
x
z
z
y
x y • Tiling the iteration space
• 2D algorithm: never chop k dim • 2.5 or 3D: Assume + is
associative; chop k, which is à replication of C matrix
k
j
i Matrix Multiplication code has a 3D iteration space Each point in the space is a constant computation (*/+)
for i for j for k
B[k,j] … A[i,k] … C[i,j] … 32
Lower Bound Idea on C = A*B Iromy, Toledo, Tiskin
33"
x
z
z
y
x y
Cubes in black box with side lengths x, y and z = Volume of black box = x*y*z = (#A□s * #B□s * #C□s )1/2
= ( xz * zy * yx)1/2
k
(i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) * area(B shadow) * area(C shadow)) 1/2
“A shadow”
“C shadow”
j
i
Lessons #5: Understand Theory
Lower bounds help identify optimizations.
34
Traditional (Naïve n2) Nbody Algorithm (using a 1D decomposition)
• Given n particles and p processors, size M memory
• Each processor has n/p particles • Algorithm: shift copy of particles to the left p
times, calculating all pairwise forces • Computation cost: n2/p • Communication cost: O(p) messages, O(n)
words 35
............ ............ ............ ............ ............ ............ ............ ............ p
Communication Avoiding Version (using a “1.5D” decomposition)
Solomonik,Yelick
• Divide p into c groups. Replicate particles within group. – First row responsible for updating all by orange, second all by green,…
• Algorithm: shift copy of n/(p*c) particles to the left – Combine with previous data before passing further level (log steps)
• Reduce across c to produce final value for each particle
• Total Computation: O(n2/p); • Total Communication: O(log(p/c) + log c) messages, O(n*(c/p+1/c)) words
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............ ............ ............ ............ ............ ............ ............ ............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
Limit: c ≤ p1/2
c
p/c
36
In theory there is no difference between theory and practice, but in
practice there is.
-- Jan L. A. van de Snepscheut, Computer Scientist or -- Yogi Berra, Baseball player and manager
37
Performance Results
• Over 2x speedup for 24K cores relative to 1D algorithm • In general, “.5D” replication best for small problems on big machines Driscoll, Georganas, Koanantakool, Solomonik, Yelick (unpublished)
38
10.84 0
0.04
0.08
0.12
0.16
0.2
0.24
96 768 6144 24576
GFlo
ps/
core
Number of cores
Strong scaling for All-Pairs (196,608 particles)
Ideal
c = 16
c = 8
c = 4
c = 2
c = 1
Have We Seen this Idea Before?
• These algorithms also maximize parallelism beyond “domain decomposition” – SIMD machine days
• Automation depends on associative operator for updates (e.g., M. Wolfe)
• Also used for “synchronization avoidance” in Particle-in-Cell code (Madduri, Su, Oliker, Yelick) – Replicate and reduce optimization given p copies – Useful on vectors / GPUs
39
Lessons #6: Aggregate Communication
Pack messages to better amortize per-message communication costs.
40
Lessons #7: Overlap and Pipeline Communication
Sometimes runs contrary to lesson #6 on aggregation
41
Communication Strategies for 3D FFT
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea!
chunk = all rows with same destination
pencil = 1 row
• Three approaches: – Chunk:
• Wait for 2nd dim FFTs to finish • Minimize # messages
– Slab: • Wait for chunk of rows destined
for 1 proc to finish • Overlap with computation
– Pencil: • Send each row as it completes • Maximize overlap and • Match natural layout slab = all rows in a single plane with
same destination
NAS FT Variants Performance Summary
• Slab is always best for MPI; small message cost too high • Pencil is always best for UPC; more overlap
0
200
400
600
800
1000
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions
Best NAS Fortran/MPIBest MPIBest UPC
0
100
200
300
400
500
600
700
800
900
1000
1100
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFl
ops
per T
hrea
d
Best NAS Fortran/MPIBest MPI (always Slabs)Best UPC (always Pencils)
.5 Tflops
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs 64 256 256 512 256 512
MFl
ops
per T
hrea
d
Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)
Lessons #8: Avoid Synchronization, a Particular form Communication
Use One-sided Communication
44
Avoiding Synchronization in Communication
• Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data – Wildly popular in HPC, but cumbersome in some applications – Couples data transfer with synchronization
• Using global address space decouples synchronization – Pay for what you need! – Note: Global Addressing ≠ Cache Coherent Shared memory
address
message id
data payload
data payload one-sided put message
two-sided message
network interface
memory
host CPU
Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group
Performance Advantage of One-Sided Communication
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
6.6
4.5
9.5
18.5
24.2
13.5
17.8
8.3
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Ro
un
dtr
ip L
ate
ncy (
usec)
MPI ping-pong
GASNet put+sync
• The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc)
• This is faster for pure data transfers than two-sided send/receive Flood Bandwidth for 4KB messages
547
420
190
702
152
252
750
714231
763223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Pe
rce
nt
HW
pe
ak
MPIGASNet
UPC at Scale on the MILC (QCD) App
0"
20000"
40000"
60000"
80000"
100000"
120000"
512" 1024" 2048" 4096" 8192" 16384" 32768"
Sites&/&Second
&
Number&of&Cores&
UPC"Opt"
MPI"
UPC"Naïve"
47
Shan, Austin, Wright, Strohmaier, Shalf, Yelick (unpublished)
Lessons #9: “Strength-Reduce” Synchronization
Replace global barriers with subset barriers, point-to-point synchronization,
and event-driven execution.
48
49
Avoid Synchronization from Applications
Cholesky 4 x 4
QR 4 x 4
Computations as DAGs View parallel executions as the directed acyclic graph of the computation
Slide source: Jack Dongarra
Event Driven LU in UPC
• Assignment of work is static; schedule is dynamic • Ordering needs to be imposed on the schedule
– Critical path operation: Panel Factorization • General issue: dynamic scheduling in partitioned memory
– Can deadlock in memory allocation – “memory constrained” lookahead
some edges omitted
51
DAG Scheduling Outperforms Bulk-Synchronous Style
UPC vs. ScaLAPACK
0
20
40
60
80
2x 4 pr oc g r i d 4x 4 pr oc g r i d
GFlop
s
ScaLAPACK
UPC
UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding – New problem in partitioned memory: allocator deadlock – Can run on of memory locally due tounlucky execution order
PLASMA on shared memory UPC on partitioned memory
PLASMA by Dongarra et al; UPC LU joint with Parray Husbands
Another reason to avoid synchronization
• Processors do not run at the same speed – Never did, due to caches – Power / temperature management makes this worse
60%
52
Lessons #10: Combine Techniques
These are no entirely orthogonal, and there are many interesting trade-offs.
53
Communication Avoidance (2.5) and Overlapped (with UPC) Matrix Multiply
0
10
20
30
40
50
60
70
1,536 6,144 24,576
Perc
enta
ge o
f M
ach
ine P
eak
Number of cores
SUMMA on Hopper (n=32,768)
2.5D-overlp2.5D2D-overlp2D
54 Georganas, Gonzalez-Domınguezy, Solomonik, Zheng, Tourinoy, Yelick (to appear at SC12)
Lessons in Communication Avoidance
1. Compress Data Structures 2. Target Higher Level Loops 3. Understand theory / numerics 4. Replicate data 5. Understand theory / lower bounds 6. Aggregate communication 7. Overlap communication 8. Use one-sided communication 9. Synchronization strength reduction 10. Combine the techniques
55
Challenges in Exascale Computing
There are many exascale challenges: • Scaling • Synchronization, • Dynamic system behavior • Irregular algorithms • Resilience But don’t forget what’s really important
56
Communication Hurts!
57