41
1 Linear Algebraic Primitives for Parallel Computing on Large Graphs $\GÕQ Buluç Ph.D. Defense March 16, 2010 Committee: John R. Gilbert (Chair) Ömer (÷HFLR÷OX Fred Chong Shivkumar Chandrasekaran

Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

1

Linear Algebraic Primitives for Parallel Computing on Large Graphs

BuluçPh.D. Defense March 16, 2010

Committee: John R. Gilbert (Chair)ÖmerFred ChongShivkumar Chandrasekaran

Page 2: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Sources of Massive Graphs

(WWW snapshot, courtesy Y. Hyun) (Yeast protein interaction network, courtesy H. Jeong)

Graphs naturally arise from the internet and social interactions

Many scientific (biological, chemical,cosmological, ecological, etc) datasets are modeled as graphs.

Page 3: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Examples:- Centrality- Shortest paths- Network flows- Strongly Connected

Components

Examples:

- Loop and multi edge removal- Triangle/Rectangle enumeration

3

Types of Graph Computations

Fuzzy intersectionExamples: Clustering,

Algebraic Multigrid

Tool: GraphTraversal Tool: Map/Reduce

1 3

4

2 5

7

6

Tightly coupled

Filtering based

map map map

reduce reduce

Page 4: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Tightly Coupled Computations

A. Computationally intensive graph/data mining algorithms.

(e.g. graph clustering, centrality, dimensionality reduction)

B. Inherently latency-bound computations.

(e.g. finding s-t connectivity)

4

Huge Graphs Expensive Kernels+ High Performance and Massive Parallelism

Tightly Coupled Computations

Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?

We need memory ! Disk is not an option !

Page 5: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

555

Software for Graph Computation

spending ten years of my life on the TeX project is that software is hard. It's harder than anything

Page 6: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

666

Software for Graph Computation

spending ten years of my life on the TeX project is that software is hard. It's harder than anything

Dealing with software is hard !

Page 7: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

777

Software for Graph Computation

spending ten years of my life on the TeX project is that software is hard. It's harder than anything

Dealing with software is hard !

High performance computing (HPC)

software is harder !

Page 8: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

888

Software for Graph Computation

spending ten years of my life on the TeX project is that software is hard. It's harder than anything

Dealing with software is hard !

High performance computing (HPC)

software is harder !

Deal with parallel HPC software?

Page 9: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Tightly Coupled Computations

9

Parallel Graph Software

Shared memory scales, not it pays off to target productivity

Distributed memory should be targeting p > 100 at least

Shared and distributed memory complement each other.

Coarse-grained parallelism for distributed memory.

Page 10: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for PrimitivesThe Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other Work

Future Directions

10

Page 11: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

11

Input:Find least-cost paths between all reachable vertex pairsClassical algorithm: Floyd-Warshall

Case study of implementation on multicore architecture:graphics processing unit (GPU)

All-Pairs Shortest Paths

for k=1:n // the induction sequencefor i = 1:n

for j = 1:nif(w(i k)+w(k ) < w(i ))

w(i ):= w(i k) + w(k )

1 52 3 41

5

2

3

4

k = 1 case

Page 12: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

12

GPU characteristics

Powerful: two Nvidia 8800s > 1 TFLOPS

Inexpensive: $500 each

Difficult  programming  model:    One  instruction  stream  drives  8  arithmetic  units

Performance  is  counterintuitive  and  fragile:Memory  access  pattern  has  subtle  effects  on  cost    

Extremely  easy  to  underutilize  the  device:Doing  it  wrong  easily  costs  100x  in  time

t1

t3t2

t4

t6t5

t7

t9t8

t10

t12t11

t13t14

t16t15

But:

Page 13: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

13

Recursive All-Pairs Shortest Paths

A B

C DA

BD

C

A = A*; % recursive callB = AB; C = CA; D = D + CB;D = D*; % recursive callB = BD; C = DC;A = A + BC;

+ ×

Based on R-Kleene algorithm

Well suited for GPU architecture:Fast matrix-multiply kernelIn-place computation => low memory bandwidthFew, large MatMul calls => low GPU dispatch overheadRecursion stack on host CPU, not on multicore GPUCareful tuning of GPU code

Page 14: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

The Case for Primitives

14

480x

Lifting Floyd-Warshallto GPU

The right primitive !

Conclusions:

High performance is achievable but not simple

Carefully chosen and optimized primitives will be key

Unorthodox R-Kleene algorithm

APSP: Experiments and observations

Page 15: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse MatricesParallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other Work

Future Directions

15

Page 16: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

16

Sparse Adjacency Matrix and Graph

Every graph is a sparse matrix and vice-versa

Adjacency matrix: sparse array w/ nonzeros for graph edges

Storage-efficient implementation from sparse data structures

x

1 2

3

4 7

6

5

AT

Page 17: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

The Case for Sparse Matrices

Many irregular applications contain su cient coarse-grained parallelism that can ONLY be exploited using abstractions at proper level.

17

Traditional graph computations

Graphs in the language of linear algebra

Data driven. Unpredictable communication.

Fixed communication patterns.

Irregular and unstructured. Poor locality of reference

Operations on matrix blocks. Exploits memory hierarchy

Fine grained data accesses. Dominated by latency

Coarse grained parallelism. Bandwidth limited

The Case for Sparse Matrices

Page 18: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Identification of Primitives

Sparse matrix-matrix Multiplication (SpGEMM)

Element-wise operations

18

Linear Algebraic Primitives

x

Matrices on semirings, e.g. ( , +), (and, or), (+, min)

Sparse matrix-vector multiplication

Sparse Matrix Indexing

x

.*

Page 19: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Why focus on SpGEMM?

Graph clustering (MCL, peer pressure)

Subgraph / submatrix indexingShortest path calculations

Betweenness centralityGraph contraction

Cycle detection

Multigrid interpolation & restriction

Colored intersection searching

Applying constraints in finite element computations

Context-free parsing ...

19

Applications of Sparse GEMM

11

11

1x x

Loop until convergenceA = A * A; % expandA = A .^ 2; % inflate sc = 1./sum(A, 1);A = A * diag(sc); % renormalizeA(A < 0.0001) = 0; % prune entries

Page 20: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix MultiplicationSoftware Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other Work

Future Directions

20

Page 21: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

21

Two Versions of Sparse GEMM

A1 A2 A3 A4 A7A6A5 A8 B1 B2 B3 B4 B7B6B5 B8 C1 C2 C3 C4 C7C6C5 C8

j

x =i

kk

Cij

Cij += Aik Bkj

Ci = Ci + A Bi

x =1D block-column distribution

Checkerboard (2D block) distribution

Page 22: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Comparative Speedup of Sparse 1D & 2D

22

In practice, 2D algorithms have the potential to scale, but not linearly

Projected performances of Sparse 1D & 2D

N P

1D 2D

NP

pppncncpcnnc

TTpWE

commcomp22

2

lg)()(

Page 23: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Stores entries in column-major order

Dense collection of

Uses storage.23

Compressed Sparse Columns (CSC):A Standard Layout

Column pointers

data

810

2 3rowindn×n matrix with nnz nonzeroes

n

0 2 3 4 0 1 5 7 3 4 5 4 5 6 7

04

1112131617

nnznO )(

Page 24: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

24

Submatrices are hypersparse (i.e. nnz << n)

blocks

blocks

Total Storage:

Average of c nonzeros per column

A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices

Node Level Considerations

Page 25: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Sequential Kernel

Strictly O(nnz) data structure Outer-product formulation Work-efficient

25

X

flops

nnz

n

Standard

Sequential Hypersparse Kernel

))()(lg( BnzrAnzcnif lops

))(( mnBnnzf lops

New hypersparse kernel:

Page 26: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

26

Scaling Results for SpGEMM

RMat X RMat product (graphs with high variance on degrees)

Random permutations useful for the overall computation (<10% imbalance).

Bulk synchronous algorithms may still suffer due to imbalance within the stages. But this goes away as matrices get larger

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000 1200 1400 1600

Speedup

Processors

SCALE  21  -­‐ 16way

SCALE  22  -­‐ 16way

SCALE  22  -­‐ 4way

SCALE  23  -­‐ 16way

Page 27: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLASAn Application in Social Network Analysis

Other Work

Future Directions

27

Page 28: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

28

Software design of theCombinatorial BLAS

Generality, of the numeric type of matrix elements, algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templates

Achieve mixed precision arithmetic: Type traitsEnforcing interface and strong type checking: CRTPGeneral semiring operation: Function Objects

template <class IT, class NT, class DER>class SpMat;

Abstraction penalty is not just a programming language issue.

In particular, view matrices as indexed data structures and stay away from single element access (Interface should discourage)

Page 29: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

29

Extendability, of the library while maintaining compatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.

TuplesCSC DCSC

SpSeq

Commonalities:- Support the sequential API- Composed of a number of arrays

SpSeq

SpPar<Comm, SpSeq>

Any parallel logic: asynchronous, bulk synchronous, etc

Software design of theCombinatorial BLAS

Each *may* support an iterator concept

Page 30: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network AnalysisOther Work

Future Directions

30

Page 31: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Applications and Algorithms

31

Betweenness Centrality (BC)CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?

Brandes

A typical software stack for an application enabled with the Combinatorial BLAS

Social Network Analysis

Page 32: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

32

6

XAT (ATX).*¬X

1 2

3

4 7 5

Betweenness Centrality using Sparse GEMM

Parallel breadth-first search is implemented with sparse matrix-matrix multiplication

Work efficient algorithm for BC

Page 33: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

33

BC Performance onDistributed-memory

TEPS: Traversed Edges Per SecondBatch of 512 vertices at each iterationCode only a few lines longer than Matlab version

0

50

100

150

200

250

25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484

TEPS  score

Millions

Number  of  Cores

BC  performance  

Scale  17Scale  18Scale  19Scale  20

Input: RMAT scale N2N verticesAverage degree 8

Pure MPI-1 version. No reliance on any particular hardware.

Page 34: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

34

More scaling and sensitivity (Betweenness Centrality)

Input is an RMat graph of scale 22

0

50

100

150

200

250

300

350

400

196 256 324 400 484 576 676 784 900

Millions  of  TEPS

Number  of  Processing  Cores

Sensitivity  to  batch  processing

Batch=128 Batch=256 Batch=512 Batch=1024Experiment with 512 batch nodes

0

50

100

150

200

250

300

350

400

196 366 536 706 876 1046 1216

Scaling  beyond  hundreds

Scale  21 Scale  22 Scale  23 Scale  24

Page 35: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other WorkFuture Directions

35

Page 36: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

36

SpMV on Multicore

Our parallel algorithms for y ATx thenew compressed sparse blocks (CSB) layout have

span, and work, yielding parallelism.)lg/( nnnnz

)(nnz)lg( nn

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

MFlops/sec

Processors

Our CSB algorithms

Star-P (CSR+blockrow distribution)

Serial (Naïve CSR)

Page 37: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

Outline

The Case for Primitives

The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other Work

Future Directions

37

Page 38: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

3838

Future Directions

‣Novel scalable algorithms

‣Static graphs are just the beginning.Dynamic graphs, Hypergraphs, Tensors

‣ Architectures (mainly nodes) are evolvingHeterogeneous multicoresHomogenous multicores with more cores per node

TACC Lonestar (2006)

4 cores / node

TACC Ranger (2008)

16 cores / node

SDSC Triton (2009)

32 cores / nodeXYZ Resource (2020)

Hierarchical parallelism

Page 39: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

3939

‣What did I learn about parallel graph computations?

‣No silver bullets.

‣Graph computations are not homogenous.

‣No free lunch.

‣ If communication >>computation for your algorithm, overlapping them will NOT make it scale.

‣Scale your algorithm, not your implementation.

‣ Any sustainable speedup is a success !

Page 40: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

40

Related Publications

Hypersparsity in 2D decomposition, sequential kernel.B., Gilbert, "On the Representation and Multiplication of Hypersparse

Parallel analysis of sparse GEMM, synchronous implementationB., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-

The case for primitives, APSP on the GPUB., Gilbert, Budak, , Parallel Computing, 2009

SpMV on MulticoresB., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-

Betweenness centrality resultsB., Gilbert, -

Software design of the libraryB., Gilbert,

Page 41: Linear Algebraic Primitives for Parallel Computing on ...aydin/Buluc_PhDDefense.pdf · A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality,

41

David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, MatteoFrigo, Bruce Hendrickson, Jeremy

Kepner, Charles Leiserson, KameshMadduri, Steve Reinhardt, Eric Robinson, Viral

Shah, Sivan Toledo