Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Faster parallel GraphBLAS kernels and new graph algorithms in matrix algebra

Aydın Buluç Computa1onal Research Division Berkeley Lab (LBNL) EECS, University of California, Berkeley October 14, 2016

Large Graphs in ScienHfic Discoveries

1 5 2 3 4 1

5

2 3 4

A

1

5

2

3

4

1

5

2

3

4

1 5 2 3 4 4

2

5 3 1

PAMatching in biparHte graphs: PermuHng to heavy diagonal or block triangular form

Graph par11oning: Dynamic load balancing in parallel simulaHons

Picture (leP) credit: Sanders and Schulz

Problem size: as big as the sparse linear system to be solved or the simulaHon to be performed


1 5 2 3 4 1

5

2 3 4

A

1

5

2

3

4

1

5

2

3

4

1 5 2 3 4 4

2

5 3 1

PAMatching in biparHte graphs: PermuHng to heavy diagonal or block triangular form

Graph par11oning: Dynamic load balancing in parallel simulaHons

Picture (leP) credit: Sanders and Schulz

Problem size: as big as the sparse linear system to be solved or the simulaHon to be performed

The case for di

stributed memory


Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)

Whole genome assembly Graph Theoretical analysis of Brain

Connectivity

PotenHally millions of neurons and billions of edges with developing technologies

26 billion (8B of which are non-‐erroneous) unique k-‐mers (verHces) in the hexaploit wheat genome W7984 for k=51

VerHces: k-‐mers

VerHces: reads


Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)

Whole genome assembly Graph Theoretical analysis of Brain

Connectivity

PotenHally millions of neurons and billions of edges with developing technologies

26 billion (8B of which are non-‐erroneous) unique k-‐mers (verHces) in the hexaploit wheat genome W7984 for k=51

VerHces: k-‐mers

VerHces: reads

The case for di

stributed memory

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

The case for sparse matrices

Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level.

Tradi1onal graph computa1ons

Graphs in the language of linear algebra

Data driven, unpredictable communicaHon.

Fixed communicaHon paêrns

Irregular and unstructured, poor locality of reference

OperaHons on matrix blocks exploit memory hierarchy

Fine grained data accesses, dominated by latency

Coarse grained parallelism, bandwidth limited

Sparse matrix X sparse matrix

x

Sparse matrix X sparse vector

x

.*

Linear-‐algebraic primiHves for graphs

Element-‐wise operaHons Sparse matrix indexing

Is think-‐like-‐a-‐vertex really more producHve? “Our mission is to build up a linear algebra sense to the extent that vector-‐level thinking becomes as natural as scalar-‐level thinking.” -‐ Charles Van Loan

Examples of semirings in graph algorithms

Real field: (R, +, x) Classical numerical linear algebra

Boolean algebra: ({0 1}, |, &) Graph traversal

Tropical semiring: (R U {∞}, min, +) Shortest paths

(S, select, select) Select subgraph, or contract nodes to form quoHent graph

(edge/vertex a^ributes, vertex data aggregaHon, edge data processing)

Schema for user-‐specified computaHon at verHces and edges

(R, max, +) Graph matching &network alignment

(R, min, 1mes) Maximal independent set

•  Shortened semiring nota1on: (Set, Add, Mul1ply). Both idenHHes omiêd. •  Add: Traverses edges, Mul1ply: Combines edges/paths at a vertex •  Neither add nor mulHply needs to have an inverse. •  Both add and mul1ply are associa1ve, mul1ply distributes over add

1 2

3

4 7

6

5

AT

1

7

7 1 from

to

Breadth-‐first search in the language of matrices

1 2

3

4 7

6

5

XAT

1

7

7 1 from

to

ATX

à

1

1

1

1

1 parents:

ParHcular semiring operaHons: Mul1ply: select Add: minimum

1 2

3

4 7

6

5

X

4

2

2

AT

1

7

7 1 from

to

ATX

à

2

4

4

2

2 4

Select vertex with minimum label as parent

1

1 parents: 4

2

2

1 2

3

4 7

6

5

X

3

AT

1

7

7 1 from

to

ATX

à 3

5

7

3

1

1 parents: 4

2

2

5

3

XAT

1

7

7 1 from

to

ATX

à

6

1 2

3

4 7

6

5

Graph Algorithms on GraphBLAS

Sparse -‐ Dense Matrix

Product (SpDM3)

Sparse -‐ Sparse Matrix

Product (SpGEMM)

Sparse Matrix Times MulHple Dense Vectors

(SpMM)

Sparse Matrix-‐Dense Vector

(SpMV)

Sparse Matrix-‐Sparse Vector (SpMSpV)

GraphBLAS primiHves in increasing arithmeHc intensity

Shortest paths (all-‐pairs,

single-‐source, temporal)

Graph clustering (Markov cluster, peer pressure, spectral, local)

Miscellaneous: connecHvity, traversal (BFS), independent sets (MIS), graph matching

Centrality (PageRank,

betweenness, closeness)

Some Graph BLAS basic funcHons

Func1on (CombBLAS equiv)

Parameters Returns Matlab nota1on

MxM (SpGEMM)

-‐ sparse matrices A and B -‐ opHonal unary functs

sparse matrix C = A * B

MxV (SpM{Sp}V)

-‐ sparse matrix A -‐ sparse/dense vector x

sparse/dense vector y = A * x

ewisemult, add, … (SpEWiseX)

-‐ sparse matrices or vectors -‐ binary funct, opHonal unarys

in place or sparse matrix/vector

C = A .* B C = A + B

reduce (Reduce)

-‐ sparse matrix A and funct dense vector y = sum(A, op)

extract (SpRef)

-‐ sparse matrix A -‐ index vectors p and q

sparse matrix B = A(p, q)

assign (SpAsgn)

-‐ sparse matrices A and B -‐ index vectors p and q

none A(p, q) = B

buildMatrix (Sparse)

-‐ list of edges/triples (i, j, v)

sparse matrix A = sparse(i, j, v, m, n)

extractTuples (Find)

-‐ sparse matrix A

edge list [i, j, v] = find(A)

Performance of Linear Algebraic Graph Algorithms

Combinatorial BLAS fastest among all tested graph processing frameworks on 3 out of 4 benchmarks in an independent study by Intel. The linear algebra abstracBon enables high performance, within 4X of naBve performance for PageRank and CollaboraBve filtering.

SaHsh, Nadathur, et al. "NavigaHng the Maze of Graph AnalyHcs Frameworks using Massive Graph Datasets”, in SIGMOD’14

The Graph BLAS effort

•  The GraphBLAS Forum: http://graphblas.org •  IEEE Workshop on Graph Algorithms Building Blocks (at IPDPS):

http://www.graphanalysis.org/workshop2017.html

Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

Outline






MulHple-‐source breadth-‐first search

•  Sparse array representation => space efficient •  Sparse matrix-matrix multiplication => work efficient •  Three possible levels of parallelism: searches, vertices, edges •  Highly-parallel implementation for Betweenness Centrality*

*: A measure of influence in graphs, based on shortest paths

B

1 2

3

4 7

6

5

AT

MulHple-‐source breadth-‐first search

•  Sparse array representation => space efficient •  Sparse matrix-matrix multiplication => work efficient •  Three possible levels of parallelism: searches, vertices, edges •  Highly-parallel implementation for Betweenness Centrality*

*: A measure of influence in graphs, based on shortest paths

BAT

à

AT � B6

1 2

3

4 7 5

Sparse Matrix-‐Matrix MulHplicaHon

Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness

centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry, high-dimensional similarity search, …

How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?

What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices

Erdős-Rényi(n,d) graphs aka G(n, p=d/n)

2D Algorithm: Sparse SUMMA

x =

€

Cij

100K 25K

5K

25K

100K

5K

A B C

2D algorithm: Sparse SUMMA (based on dense SUMMA) General implementation that handles rectangular matrices

Cij += HyperSparseGEMM(Arecv, Brecv)

0.25

0.5

1

2

4

8

16

32

4 9 16 36

64 121

256 529

1024

2025

4096

8100

Seconds

Number of Cores

Scale-21Scale-22Scale-23Scale-24

Almost linear scaling unHl bandwidth costs starts to dominate

2D Sparse SUMMA on square inputs

Scaling proporHonal to √p aPerwards

R-‐MAT, edgefactor: 8 a=0.6, b=c=d=0.4/3

NERSC/Franklin Cray XT4

Aydin Buluç and John R. Gilbert. Parallel sparse matrix-‐matrix mulHplicaHon and indexing: ImplementaHon and experiments. SIAM Journal of ScienBfic CompuBng (SISC), 2012.

The computaHon cube and sparsity-‐independent algorithms

Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),

A B C

The computaBon (discrete) cube:

•  A face for each (input/output) matrix

•  A grid point for each mulHplicaHon

1D algorithms 2D algorithms 3D algorithms

How about sparse algorithms?

Sparsity independent algorithms: assigning grid-‐points to processors is independent of sparsity structure. -‐ In parHcular: if Cij is non-‐zero, who holds it? -‐ all standard algorithms are sparsity independent

AssumpHons: -‐ Sparsity independent algorithms -‐ input (and output) are sparse: -‐ The algorithm is load balanced

Algorithms aâining lower bounds

Previous Sparse Classical:

[Ballard, et al. SIMAX’11]

( ) ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

[here] Expected (Under some technical assumpHons) No previous algorithm aâin these.

⎟⎟⎠

⎞⎜⎜⎝

⎛

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min

û No algorithm aâin this bound!

ü Two new algorithms achieving the bounds (Up to a logarithmic factor)

i.  Recursive 3D, based on [Ballard, et al. SPAA’12]

ii.  IteraHve 3D, based on [Solomonik & Demmel EuroPar’11]

Ballard, B., Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaHon opHmal parallel mulHplicaHon of sparse random matrices. In SPAA 2013.

3D parallel SpGEMM in a nutshell

Ariful Azad, Grey Ballard, B., James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams. ExploiHng mulHple levels of parallelism in sparse matrix-‐matrix mulHplicaHon. SIAM Journal on ScienHfic CompuHng (SISC), to appear.

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

3D SpGEMM performance

64 256 1024 4096 163840.25

1

4

16

Number of Cores

Tim

e (s

ec)

nlpkkt160 x nlpkkt160 (on Edison)

2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

2D#threads

increasing

3D#layers &#threads

increasing

Strong scaling of different variants of 2D and 3D algorithms when squaring of nlpkkt160 matrix on Edison.

2D (non-‐threaded) is the previous state-‐of-‐the art

3D (threaded) – first presented here – beats it by 8X at large concurrencies

3D SpGEMM performance

•  Matrix squaring (A*A), proxy for Markov clustering (MCL)

2004 web crawl of .it domain 41 million ver1ces, 1.1 billion edges

SSCA (R-‐MAT) matrices on Titan

1024 2048 4096 8192 163842

4

8

16

32

Number of Cores

Tim

e (s

ec)

it−2004 (AA)

2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6) 512 1024 2048 4096 8192 16384 32768 655360.125

0.25

0.5

1

2

4

8

16

32

Number of Cores

Tim

e (s

ec)

SSCA x SSCA (on Titan with c=16, t=8 )

Scale 24Scale 25Scale 26Scale 27

Outline






Counting triangles (clustering coefficient)

A

5

6

3

1 2

4

Clustering coefficient:

•  Pr (wedge i-j-k makes a triangle with edge i-k)

•  3 * # triangles / # wedges

•  3 * 4 / 19 = 0.63 in example

•  may want to compute for each vertex j

Cohen’s algorithm to count triangles:

- Count triangles by lowest-degree vertex.

- Enumerate “low-hinged” wedges.

- Keep wedges that close.

hi hilo

hi hilo

hihilo

Counting triangles (clustering coefficient)

A L U

1 2

1 1 1 2

C

A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)

A ∧ B = C (closed wedge)

sum(C)/2 = 4 triangles

A

5

6

3

1 2

4 5

6

3

1 2

4

1

1

2

B, C

Masked sparse matrix-matrix multiplication (SpGEMM)

Avoid communicating nonzeros that fall outside this structure via C = MaskedSpGEMM (L, U, A), which is mathematically equivalent to performing B = L U followed by C = A ∧ B.

Special SpGEMM: The nonzero structure of the product is contained in the original adjacency matrix A

U

x

A L

1

5

2 3 4

6 7

1

5

2 3 4

6 7

⊇

Ariful Azad, B., and John R Gilbert. “Parallel triangle counHng and enumeraHon using matrix algebra”. Graph Algorithm Building Blocks (GABB), IPDPSW, 2015

Distributed memory performance of triangle counHng

1"

2"

4"

8"

16"

32"

64"

128"

256"

1" 4" 16" 64" 256"

Speedu

p"

Number"of"Cores"

(a)$Triangle,basic$$

coPapersDBLP"mouse<gene"cage15"europe_osm"

1"

2"

4"

8"

16"

32"

64"

128"

1" 4" 16" 64" 256"Speedu

p"Number"of"Cores"

(b)$Triangle,masked,bloom$

coPapersDBLP"

mouse<gene"

cage15"

europe_osm"

•  Improved 1D algorithm (SPAA’13 by Ballard, Buluc, et al.), which had not been implemented and evaluated before.

•  Scales reasonably well unHl 512 cores of NERSC/Edison. Further scaling requires 2D/3D algorithms (ongoing work)

•  Bloom filters are used in (b) to signal the presence/absence of zeros in the output to avoid communicaHon.

Outline






Graph vs. BiparHte Graph

q  A graph consists of vertices and edges: G(V,E) q  A bipartite graph: vertices are grouped into two sets.

No edge between vertices in a set. q  Notation: n = #vertices, m= #edges

x1

x2

x3

y1

y2

y3

x1

x2

y1

y2

A graph A biparHte graph

A Matching in a Graph

Matching: A subset M of edges with no common end vertices.

|M| = Cardinality of the matching M

x1

x2

x3

y1

y2

y3

Matched vertex

Unmatched vertex

Matched edge

Unmatched edge

x1

x2

x3

y1

y2

y3

Maximum Cardinality Matching A Matching (Maximal cardinality)

Maximum-‐cardinality matching

q  Augmenting path: A path that alternates between matched and unmatched edges with unmatched end points.

q  Algorithm: Search for augmenting paths and flip edges across the paths to increase matching.

x1

x2

x3

y1

y2

y3

x1

x2

x3

y1

y2

y3

Augment matching

Single-‐source (SS) search for augmenHng paths

x1

x2

x3

y1

y2

y3

1. IniHal matching

x1

x2

x3

y1

y2

y3

3. Increase matching by flipping edges in the augmenHng path

x3

y3

x2

y1 y2

2. Search for augmenHng path from x3. stop when an unmatched vertex found. Discard the whole tree when no augmen1ng path is found

AugmenHng path

Repeat the process for other unmatched verHces

MulH-‐source (MS) search for augmenHng paths

x1

x2

x3

y1

y2

y3

1. IniHal matching

x1

x2

x3

y1

y2

y3

3. Increase matching by flipping edges in the augmenHng paths

Repeat the process unHl no augmenHng path is found

x3

y3

x2

y1 y2

2. Search for vertex-‐disjoint augmenHng paths from x3 & x1. Grow a tree unHl an unmatched vertex is found in it. Can not discard a tree if no augmen1ng path is found.

AugmenHng paths x1

y1

Search Forest

Tree GraPing

x1

x2

x3

x4

x5

y1

y2

y3

y4

y5

x6 y6

(a) A maximal matching in a Bipartite Graph

Active Tree Renewable Tree

x1 x2

x3 x4 x5

y1 y2 y3

y4 y5

(b) Alternating BFS Forest Augment in forest

x1

x3 x4

y1 y2

Active Tree

x2

y3

(c) Tree Grafting

x1

x3 x4

y1 y2

Active Tree

x2

y3 y4

y6

x6

(d) Continue BFS

Unvisited Vertices

Comparison with state-‐of-‐the-‐art

0"

4"

8"

12"

16"

kkt_po

wer"

hugetrace"

delaun

ay"

rgg_n2

4_s0"

coPaersDBL

P"

amazon

031

2"

citBPatents"

RMAT

"

road_u

sa"

wbBed

u"

web

BGoo

gle"

wikiped

ia"

RelaIve"pe

rformance" PushBRelabel"

PothenBFan"MSBBFSBGraN"

(a)$one$core$

0"

4"

8"

12"

16"

kkt_power

"huget

race" delaunay"

rgg_n24_s0

"

coPaersDBL

P"

amazon03

12"citBPa

tents"

RMAT"road_

usa" wbBedu"webBG

oogle"wikip

edia"

RelaIve"pe

rformance" 18$ 35$ 29$ 42$(b)$40$cores$

Intel Westmere-‐EX

ScienHfic compuHng Scale-‐free Networks

Sources of performance

0"

2"

4"

6"

8"

Scien,fic" Scale0free" Networks"

Rela,ve""Perform

ance"" MS0BFS"

MS0BFS"(dir0opt)"

MS0BFS0GraE"

1.  DirecHon opHmized BFS (1.6x) 2.  Tree graPing (3x) On 40 cores

Intel Westmere-‐EX

Highest performance improvement on graphs with low matching number

Ariful Azad, B., Alex Pothen. A parallel tree graPing algorithm for maximum cardinality matching in biparHte graphs. In Proceedings of the IPDPS, 2015. Ariful Azad, B., Alex Pothen. CompuHng maximum cardinality matchings in parallel on biparHte graphs via tree-‐graPing. IEEE TransacHons on Parallel and Distributed Systems (TPDS)), 2016.

16 32 64 128 256 512 1024 20482

4

8

16

32

64

128

256

Number of Cores

Tim

e (s

ec)

ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R 0

8

16

24

32

Dynamic Mindegree

Karp- Sipser

Greedy No Init

Tim

e (s

ec)

nlpkkt200 Maximum Matching Maximal Matching

0!

4!

8!

12!

16!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

road_usa!

0!

2!

4!

6!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

GL7d19!

0!

2!

4!

6!

8!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

wikipedia!

Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016

Matching in distributed memory

Outline






HipMer: An Extreme-‐Scale De Novo Genome Assembler

k-mers

New k-‐mer analysis filters errors using probabilisBc “Bloom Filter”

Graph algorithm (connected components) scales to 15K cores on NERSC’s Edison

contigs

Scaffolding using Scalable Alignment

x x

New fast & parallel I/O

reads

Meraculous Assembly Pipeline Meraculous assembler is used in produc1on at the Joint Genome Ins1tute •  Wheat assembly is a “grand challenge” •  Hardest part is conHg generaHon (large in-‐

memory hash table that represents graph) •  HipMer is an efficient parallelizaHon of

Meraculous that leverages the capabiliHes of Unified Parallel C (UPC)

Performance improvement from days to minutes

16

32

64

128

256

512

1024

2048

4096

8192

1920 3840 7680 15360 23040

Seco

nds

Number of Cores

ideal IO cached overall No IO cached

overall IO cachedkmer analysis

contig generationscaffolding

IO

Future Work: Machine Learning

Sparse -‐ Dense Matrix

Product (SpDM3)

Sparse -‐ Sparse Matrix

Product (SpGEMM)

Sparse Matrix Times MulHple Dense Vectors

(SpMM)

Sparse Matrix-‐Dense Vector

(SpMV)

Sparse Matrix-‐Sparse Vector (SpMSpV)

GraphBLAS funcHons (in increasing arithmeHc intensity)

Graphical Model Structure Learning (e.g., CONCORD)

Clustering (e.g., MCL, Spectral Clustering)

LogisHc Regression,

Support Vector Machines

Dimensionality ReducHon (e.g.,

NMF, CX/CUR, PCA)

Higher-‐level machine learning tasks

Future Work: Metagenomics

•  Microbiomes: dynamic consorHa of hundreds or thousands of microbial species and strains of varying abundance and diversity.

•  Metagenomics: the applicaHon of high-‐throughput genome sequencing technologies to DNA extracted from microbiomes

Challenges: -‐  Metagenome assembly is

computaHonally harder than single genome

-‐  Protein clustering at this scale has never been done before

Figure credit: Natalia Ivanova (JGI)

Acknowledgments

Ariful Azad, David Bader, Grey Ballard, Sco^ Beamer, Jarrod Chapman, James Demmel, Rob Egan, Evangelos Georganas, John Gilbert, Laura Grigori, Steve Hofmeyr, CosHn Iancu, Jeremy Kepner, Penporn Koanantakool, Benjamin Lipshitz, Tim Ma^son, Sco^ McMillan, Henning Meyerhenke, Jose Moreira, Sang Oh, Lenny Oliker, John Owens, Alex Pothen, Dan Rokhsar, Oded Schwartz, Sivan Toledo, Sam Williams, Carl Yang, Kathy Yelick.

Work is funded by

Documents

Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear