26
Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign Collaborators on this allocation: Amanda Bienz, University of Illinois at Urbana-Champaign Bill Gropp, University of Illinois at Urbana-Champaign Andrew Reisner, University of Illinois at Urbana-Champaign Lukas Spies, University of Illinois at Urbana-Champaign

Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters SymposiumLuke OlsonDepartment of Computer Science, University of Illinois at Urbana-Champaign

Collaborators on this allocation: Amanda Bienz, University of Illinois at Urbana-ChampaignBill Gropp, University of Illinois at Urbana-ChampaignAndrew Reisner, University of Illinois at Urbana-ChampaignLukas Spies, University of Illinois at Urbana-Champaign

Page 2: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Figure: XPACC @ Illinois

Time Stepping

Sparse Matrix Operations

Figure: MD Anderson Figure: Fischer @ Illinois

PCA / ClusteringLinear Systems

w A ⇤ v

C A ⇤B

C R ⇤A ⇤RT

w A�1v

Sparse Matrix-Vector multiplication (SpMV)

Figure:QMCpack

Eigen analysis

Page 3: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

What is this talk about? (Why it matters)

• 10s, 100s, 1000s, … of SpMVs in a computation

• SpMV is a major kernel, but is limited efficiency and limited scalability

• Use machine layout (nodes) on Blue Waters to reduce communication

• Use consistent timings on Blue Waters to develop accurate performance models

Iterative method for solvingAx = b

while...

↵ hr, zi/hAp, pix x+ ↵p

r+ r � ↵Ap

z+ precond(r)

� hr+, z+i/hr, zip z + �p

CA algorithms, see Eller/Gropp

SpMV2…10…100 SpMVs

Page 4: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

p = 0

p = 1

p = 2

p = 3

p = 4

p = 5

• Solid blocks: on-process portion • Patterned blocks: off-process portion (requires communication of the input vector)

Anatomy of a Sparse Matrix-Vector (SpMV) product

w A v

P0

P1

P2

P3

w A ⇤ v

Data layout Where data is sent

Page 5: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

• Modeling difficult (more later) • Basic SpMV: rows-per-process layout

Cost of a Sparse Matrix-Vector (SpMV) product

500K 100K 50KNon-zeros per core

0

20

40

60

80

100

%of

Tim

ein

Com

mun

icat

ion

Process ID Process ID

Proc

ess

ID

SpMVAll-reduce SpMV

nlpkkt240

Page 6: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Case Study: Preconditioning (Algebraic Multigrid)

• AMG: Algebraic Multigrid iteratively whittles away at the error

• Series or hierarchy of successively smaller (and more dense) sparse matrices

• SpMV dominated

x x+ !Ar

x x+ !A1r

x x+ !A2r

x x+ !Ar

x x+ !A1r

x x+ !A2r

A0

A1

A2

A3

nnz

n rows= 30

nnz

n rows= 64

nnz

n rows= 66

nnz

n rows= 26 Level 0

Level 1

Level 2

Level 3

Page 7: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Case Study: Preconditioning (Algebraic Multigrid)

• MFEM discretization

• Linear elasticity

• 8192 cores, 512 nodes, 10k dof / core

0 5 10 15 20 25Level in AMG Hierarchy

10�4

10�3

10�2

Tim

e(S

econ

ds)

Smaller matrices == more communication

Page 8: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Observation 1: message volume between procs

Maximum number of messages

Maximum size of messages

0 5 10 15 20AMG Level

102

103

Max

Num

ber

ofM

essa

ges

0 5 10 15 20AMG Level

104

105

Max

Mes

sage

sSi

ze(b

ytes

)

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket

Page 9: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Observation 2: limits of communication

T = ↵+ppn · s

min (RN , ppn ·RB)

latency message size

Bandwidth between two processes

Node injection bandwidth

Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test,Gropp, Olson, Samfass, EuroMPI 2016.

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket

Page 10: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Observation 3: node locality

100 101 102 103 104 105 106

Number of Bytes Communicated

10�6

10�5

10�4

Tim

e(s

econ

ds

Network (PPN � 4)

Network (PPN < 4)

On-Node

On-Socket

• Split into short, eager, rendezvous

• Partition into on-socket, on-node, and off-node

1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket

Page 11: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Anatomy of a node level SpMV product

P0

P1

P2

P3

P4

P5

N0 N1 N2

Six processes distributed across three nodes

Linear system distributed across the processes

w A v

P0

P1

P2

P3

P4

P5

Page 12: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Standard Communication

n m

q

Node Node

core

Page 13: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Standard Communication

n m

p

Node Node

core

Page 14: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

New Algorithm: On-Node Communication

n

p

Page 15: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

New Algorithm: Off-Node Communication

n m

p

q

Page 16: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

New Algorithm: Off-Node Communication

n m

p

q

Page 17: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

New Algorithm: Off-Node Communication

n m

p

q

Page 18: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

New Algorithm: Off-Node Communication

n m

p

q

Page 19: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Node-Aware Parallel (NAP) Matrix Operation

1.) Redistribute initial values

n m

p

q

n m

p

q

2.) Inter-node communication

3.) Redistribute received values 4.) On-nodecommunication

5.) Local computationwith on-process, on-node,

and off-node portionsof Matrix

Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3

n m

p

q

n

p

Page 20: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Case Study: Preconditioning (Algebraic Multigrid)

Off-node On-node

Maximum number of messages sent from any process on 16,384 processes

0 5 10 15 20AMG Level

101

Max

Num

ber

ofO

n-N

ode

Mes

sage

s

ref. SpMV TAPSpMV

0 5 10 15 20AMG Level

101

102

103

Max

O↵-

Nod

eN

umber

ofM

essa

ges

ref. SpMV TAPSpMV

Page 21: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Case Study: Preconditioning (Algebraic Multigrid)

Maximum size of messages sent from any process on 16,384 processes

0 5 10 15 20AMG Level

101

102

103

Max

On-

Nod

eM

essa

ges

Size

(byt

es)

ref. SpMV TAPSpMV

0 5 10 15 20AMG Level

103

104

105

Max

O↵-

Nod

eM

essa

ges

Size

(byt

es)

ref. SpMV TAPSpMV

Off-node On-node

Page 22: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Case Study: Preconditioning (Algebraic Multigrid)

0 5 10 15 20AMG Level

10�3

10�2

Tim

e(s

econ

ds)

ref. SpMV TAPSpMV

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Number of Processes

10�1

Tim

e(s

econ

ds)

ref. SpMV TAPSpMV

Total Time Strong Scaling

Node aware sparse matrix-vector multiplication,Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv

Page 23: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Cost analysis on Blue Waters

• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention

100 101 102 103 104

Number of Messages Communicated

10�6

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

100 101 102 103 104

Number of Messages Communicated

10�6

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

• MPI Irecv message queue costly • Identified a quadratic cost

Page 24: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Cost analysis on Blue Waters

• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention

• Network contention is costly • Identified a hop model

100 101 102 103 104

Number of Messages Communicated

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

G0 G1 G2 G3

100 101 102 103 104

Number of Messages Communicated

10�5

10�4

10�3

10�2

10�1

100

Tim

e(s

econ

ds)

16 Bytes

64 Bytes

256 Bytes

1024 Bytes

4096 Bytes

16384 Bytes

65536 Bytes

262144 Bytes

Page 25: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Cost analysis on Blue Waters

• Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention

0 1 2 3 4 5 6Level in AMG Hierarchy

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Tim

e(s

econ

ds)

Measured

Max-Rate

Queue Search

Contention

Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.

Page 26: Reducing Communication in Sparse Matrix Operations 2018 ... · AMG Level 102 Max Number of Messages 103 0 5 10 15 20 AMG Level 104 Max Messages Size (bytes) 105 1. high volume of

Summary and Ongoing Work• Drop in replacement for a range of Sparse Matrix operations

(SpMV, SPMM, MIS(k), assembly operations, etc) • Blue Waters instrumental in testing at scale, reproducible

outcomes, and accurate performance analysis. • (this) Code base: https://github.com/lukeolson/raptor

• Structured code base: https://github.com/cedar-framework/cedar

This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.