Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Faster parallel GraphBLAS kernels and new graph algorithms in matrix algebra
Aydın Buluç Computa1onal Research Division Berkeley Lab (LBNL) EECS, University of California, Berkeley October 14, 2016
Large Graphs in ScienHfic Discoveries
1 5 2 3 4 1
5
2 3 4
A
1
5
2
3
4
1
5
2
3
4
1 5 2 3 4 4
2
5 3 1
PAMatching in biparHte graphs: PermuHng to heavy diagonal or block triangular form
Graph par11oning: Dynamic load balancing in parallel simulaHons
Picture (leP) credit: Sanders and Schulz
Problem size: as big as the sparse linear system to be solved or the simulaHon to be performed
Large Graphs in ScienHfic Discoveries
1 5 2 3 4 1
5
2 3 4
A
1
5
2
3
4
1
5
2
3
4
1 5 2 3 4 4
2
5 3 1
PAMatching in biparHte graphs: PermuHng to heavy diagonal or block triangular form
Graph par11oning: Dynamic load balancing in parallel simulaHons
Picture (leP) credit: Sanders and Schulz
Problem size: as big as the sparse linear system to be solved or the simulaHon to be performed
The case for di
stributed memory
Large Graphs in ScienHfic Discoveries
Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)
Whole genome assembly Graph Theoretical analysis of Brain
Connectivity
PotenHally millions of neurons and billions of edges with developing technologies
26 billion (8B of which are non-‐erroneous) unique k-‐mers (verHces) in the hexaploit wheat genome W7984 for k=51
VerHces: k-‐mers
VerHces: reads
Large Graphs in ScienHfic Discoveries
Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)
Whole genome assembly Graph Theoretical analysis of Brain
Connectivity
PotenHally millions of neurons and billions of edges with developing technologies
26 billion (8B of which are non-‐erroneous) unique k-‐mers (verHces) in the hexaploit wheat genome W7984 for k=51
VerHces: k-‐mers
VerHces: reads
The case for di
stributed memory
Outline
• GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra
• SpGEMM: Computing the sparse matrix-matrix multiplication in parallel
• Triangle counting/enumeration in matrix algebra
• Bipartite graph matching in parallel
• Other contributions and future work
The case for sparse matrices
Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level.
Tradi1onal graph computa1ons
Graphs in the language of linear algebra
Data driven, unpredictable communicaHon.
Fixed communicaHon pa^erns
Irregular and unstructured, poor locality of reference
OperaHons on matrix blocks exploit memory hierarchy
Fine grained data accesses, dominated by latency
Coarse grained parallelism, bandwidth limited
Sparse matrix X sparse matrix
x
Sparse matrix X sparse vector
x
.*
Linear-‐algebraic primiHves for graphs
Element-‐wise operaHons Sparse matrix indexing
Is think-‐like-‐a-‐vertex really more producHve? “Our mission is to build up a linear algebra sense to the extent that vector-‐level thinking becomes as natural as scalar-‐level thinking.” -‐ Charles Van Loan
Examples of semirings in graph algorithms
Real field: (R, +, x) Classical numerical linear algebra
Boolean algebra: ({0 1}, |, &) Graph traversal
Tropical semiring: (R U {∞}, min, +) Shortest paths
(S, select, select) Select subgraph, or contract nodes to form quoHent graph
(edge/vertex a^ributes, vertex data aggregaHon, edge data processing)
Schema for user-‐specified computaHon at verHces and edges
(R, max, +) Graph matching &network alignment
(R, min, 1mes) Maximal independent set
• Shortened semiring nota1on: (Set, Add, Mul1ply). Both idenHHes omi^ed. • Add: Traverses edges, Mul1ply: Combines edges/paths at a vertex • Neither add nor mulHply needs to have an inverse. • Both add and mul1ply are associa1ve, mul1ply distributes over add
1 2
3
4 7
6
5
AT
1
7
7 1 from
to
Breadth-‐first search in the language of matrices
1 2
3
4 7
6
5
XAT
1
7
7 1 from
to
ATX
à
1
1
1
1
1 parents:
ParHcular semiring operaHons: Mul1ply: select Add: minimum
1 2
3
4 7
6
5
X
4
2
2
AT
1
7
7 1 from
to
ATX
à
2
4
4
2
2 4
Select vertex with minimum label as parent
1
1 parents: 4
2
2
1 2
3
4 7
6
5
X
3
AT
1
7
7 1 from
to
ATX
à 3
5
7
3
1
1 parents: 4
2
2
5
3
XAT
1
7
7 1 from
to
ATX
à
6
1 2
3
4 7
6
5
Graph Algorithms on GraphBLAS
Sparse -‐ Dense Matrix
Product (SpDM3)
Sparse -‐ Sparse Matrix
Product (SpGEMM)
Sparse Matrix Times MulHple Dense Vectors
(SpMM)
Sparse Matrix-‐Dense Vector
(SpMV)
Sparse Matrix-‐Sparse Vector (SpMSpV)
GraphBLAS primiHves in increasing arithmeHc intensity
Shortest paths (all-‐pairs,
single-‐source, temporal)
Graph clustering (Markov cluster, peer pressure, spectral, local)
Miscellaneous: connecHvity, traversal (BFS), independent sets (MIS), graph matching
Centrality (PageRank,
betweenness, closeness)
Some Graph BLAS basic funcHons
Func1on (CombBLAS equiv)
Parameters Returns Matlab nota1on
MxM (SpGEMM)
-‐ sparse matrices A and B -‐ opHonal unary functs
sparse matrix C = A * B
MxV (SpM{Sp}V)
-‐ sparse matrix A -‐ sparse/dense vector x
sparse/dense vector y = A * x
ewisemult, add, … (SpEWiseX)
-‐ sparse matrices or vectors -‐ binary funct, opHonal unarys
in place or sparse matrix/vector
C = A .* B C = A + B
reduce (Reduce)
-‐ sparse matrix A and funct dense vector y = sum(A, op)
extract (SpRef)
-‐ sparse matrix A -‐ index vectors p and q
sparse matrix B = A(p, q)
assign (SpAsgn)
-‐ sparse matrices A and B -‐ index vectors p and q
none A(p, q) = B
buildMatrix (Sparse)
-‐ list of edges/triples (i, j, v)
sparse matrix A = sparse(i, j, v, m, n)
extractTuples (Find)
-‐ sparse matrix A
edge list [i, j, v] = find(A)
Performance of Linear Algebraic Graph Algorithms
Combinatorial BLAS fastest among all tested graph processing frameworks on 3 out of 4 benchmarks in an independent study by Intel. The linear algebra abstracBon enables high performance, within 4X of naBve performance for PageRank and CollaboraBve filtering.
SaHsh, Nadathur, et al. "NavigaHng the Maze of Graph AnalyHcs Frameworks using Massive Graph Datasets”, in SIGMOD’14
The Graph BLAS effort
• The GraphBLAS Forum: http://graphblas.org • IEEE Workshop on Graph Algorithms Building Blocks (at IPDPS):
http://www.graphanalysis.org/workshop2017.html
Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
Outline
• GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra
• SpGEMM: Computing the sparse matrix-matrix multiplication in parallel
• Triangle counting/enumeration in matrix algebra
• Bipartite graph matching in parallel
• Other contributions and future work
MulHple-‐source breadth-‐first search
• Sparse array representation => space efficient • Sparse matrix-matrix multiplication => work efficient • Three possible levels of parallelism: searches, vertices, edges • Highly-parallel implementation for Betweenness Centrality*
*: A measure of influence in graphs, based on shortest paths
B
1 2
3
4 7
6
5
AT
MulHple-‐source breadth-‐first search
• Sparse array representation => space efficient • Sparse matrix-matrix multiplication => work efficient • Three possible levels of parallelism: searches, vertices, edges • Highly-parallel implementation for Betweenness Centrality*
*: A measure of influence in graphs, based on shortest paths
BAT
à
AT � B6
1 2
3
4 7 5
Sparse Matrix-‐Matrix MulHplicaHon
Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness
centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry, high-dimensional similarity search, …
How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?
What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices
Erdős-Rényi(n,d) graphs aka G(n, p=d/n)
2D Algorithm: Sparse SUMMA
x =
€
Cij
100K 25K
5K
25K
100K
5K
A B C
2D algorithm: Sparse SUMMA (based on dense SUMMA) General implementation that handles rectangular matrices
Cij += HyperSparseGEMM(Arecv, Brecv)
0.25
0.5
1
2
4
8
16
32
4 9 16 36
64 121
256 529
1024
2025
4096
8100
Seconds
Number of Cores
Scale-21Scale-22Scale-23Scale-24
Almost linear scaling unHl bandwidth costs starts to dominate
2D Sparse SUMMA on square inputs
Scaling proporHonal to √p aPerwards
R-‐MAT, edgefactor: 8 a=0.6, b=c=d=0.4/3
NERSC/Franklin Cray XT4
Aydin Buluç and John R. Gilbert. Parallel sparse matrix-‐matrix mulHplicaHon and indexing: ImplementaHon and experiments. SIAM Journal of ScienBfic CompuBng (SISC), 2012.
The computaHon cube and sparsity-‐independent algorithms
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C
The computaBon (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each mulHplicaHon
1D algorithms 2D algorithms 3D algorithms
How about sparse algorithms?
Sparsity independent algorithms: assigning grid-‐points to processors is independent of sparsity structure. -‐ In parHcular: if Cij is non-‐zero, who holds it? -‐ all standard algorithms are sparsity independent
AssumpHons: -‐ Sparsity independent algorithms -‐ input (and output) are sparse: -‐ The algorithm is load balanced
Algorithms a^aining lower bounds
Previous Sparse Classical:
[Ballard, et al. SIMAX’11]
( ) ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⋅ΩPM
M
FLOPs3
#⎟⎟⎠
⎞⎜⎜⎝
⎛Ω=
MPnd 2
New Lower bound for Erdős-Rényi(n,d) :
[here] Expected (Under some technical assumpHons) No previous algorithm a^ain these.
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎭⎬⎫
⎩⎨⎧
ΩPnd
Pdn 2
,min
û No algorithm a^ain this bound!
ü Two new algorithms achieving the bounds (Up to a logarithmic factor)
i. Recursive 3D, based on [Ballard, et al. SPAA’12]
ii. IteraHve 3D, based on [Solomonik & Demmel EuroPar’11]
Ballard, B., Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaHon opHmal parallel mulHplicaHon of sparse random matrices. In SPAA 2013.
3D parallel SpGEMM in a nutshell
Ariful Azad, Grey Ballard, B., James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams. ExploiHng mulHple levels of parallelism in sparse matrix-‐matrix mulHplicaHon. SIAM Journal on ScienHfic CompuHng (SISC), to appear.
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
3D SpGEMM performance
64 256 1024 4096 163840.25
1
4
16
Number of Cores
Tim
e (s
ec)
nlpkkt160 x nlpkkt160 (on Edison)
2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)
2D#threads
increasing
3D#layers &#threads
increasing
Strong scaling of different variants of 2D and 3D algorithms when squaring of nlpkkt160 matrix on Edison.
2D (non-‐threaded) is the previous state-‐of-‐the art
3D (threaded) – first presented here – beats it by 8X at large concurrencies
3D SpGEMM performance
• Matrix squaring (A*A), proxy for Markov clustering (MCL)
2004 web crawl of .it domain 41 million ver1ces, 1.1 billion edges
SSCA (R-‐MAT) matrices on Titan
1024 2048 4096 8192 163842
4
8
16
32
Number of Cores
Tim
e (s
ec)
it−2004 (AA)
2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6) 512 1024 2048 4096 8192 16384 32768 655360.125
0.25
0.5
1
2
4
8
16
32
Number of Cores
Tim
e (s
ec)
SSCA x SSCA (on Titan with c=16, t=8 )
Scale 24Scale 25Scale 26Scale 27
Outline
• GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra
• SpGEMM: Computing the sparse matrix-matrix multiplication in parallel
• Triangle counting/enumeration in matrix algebra
• Bipartite graph matching in parallel
• Other contributions and future work
Counting triangles (clustering coefficient)
A
5
6
3
1 2
4
Clustering coefficient:
• Pr (wedge i-j-k makes a triangle with edge i-k)
• 3 * # triangles / # wedges
• 3 * 4 / 19 = 0.63 in example
• may want to compute for each vertex j
Cohen’s algorithm to count triangles:
- Count triangles by lowest-degree vertex.
- Enumerate “low-hinged” wedges.
- Keep wedges that close.
hi hilo
hi hilo
hihilo
Counting triangles (clustering coefficient)
A L U
1 2
1 1 1 2
C
A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)
A ∧ B = C (closed wedge)
sum(C)/2 = 4 triangles
A
5
6
3
1 2
4 5
6
3
1 2
4
1
1
2
B, C
Masked sparse matrix-matrix multiplication (SpGEMM)
Avoid communicating nonzeros that fall outside this structure via C = MaskedSpGEMM (L, U, A), which is mathematically equivalent to performing B = L U followed by C = A ∧ B.
Special SpGEMM: The nonzero structure of the product is contained in the original adjacency matrix A
U
x
A L
1
5
2 3 4
6 7
1
5
2 3 4
6 7
⊇
Ariful Azad, B., and John R Gilbert. “Parallel triangle counHng and enumeraHon using matrix algebra”. Graph Algorithm Building Blocks (GABB), IPDPSW, 2015
Distributed memory performance of triangle counHng
1"
2"
4"
8"
16"
32"
64"
128"
256"
1" 4" 16" 64" 256"
Speedu
p"
Number"of"Cores"
(a)$Triangle,basic$$
coPapersDBLP"mouse<gene"cage15"europe_osm"
1"
2"
4"
8"
16"
32"
64"
128"
1" 4" 16" 64" 256"Speedu
p"Number"of"Cores"
(b)$Triangle,masked,bloom$
coPapersDBLP"
mouse<gene"
cage15"
europe_osm"
• Improved 1D algorithm (SPAA’13 by Ballard, Buluc, et al.), which had not been implemented and evaluated before.
• Scales reasonably well unHl 512 cores of NERSC/Edison. Further scaling requires 2D/3D algorithms (ongoing work)
• Bloom filters are used in (b) to signal the presence/absence of zeros in the output to avoid communicaHon.
Outline
• GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra
• SpGEMM: Computing the sparse matrix-matrix multiplication in parallel
• Triangle counting/enumeration in matrix algebra
• Bipartite graph matching in parallel
• Other contributions and future work
Graph vs. BiparHte Graph
q A graph consists of vertices and edges: G(V,E) q A bipartite graph: vertices are grouped into two sets.
No edge between vertices in a set. q Notation: n = #vertices, m= #edges
x1
x2
x3
y1
y2
y3
x1
x2
y1
y2
A graph A biparHte graph
A Matching in a Graph
Matching: A subset M of edges with no common end vertices.
|M| = Cardinality of the matching M
x1
x2
x3
y1
y2
y3
Matched vertex
Unmatched vertex
Matched edge
Unmatched edge
x1
x2
x3
y1
y2
y3
Maximum Cardinality Matching A Matching (Maximal cardinality)
Maximum-‐cardinality matching
q Augmenting path: A path that alternates between matched and unmatched edges with unmatched end points.
q Algorithm: Search for augmenting paths and flip edges across the paths to increase matching.
x1
x2
x3
y1
y2
y3
x1
x2
x3
y1
y2
y3
Augment matching
Single-‐source (SS) search for augmenHng paths
x1
x2
x3
y1
y2
y3
1. IniHal matching
x1
x2
x3
y1
y2
y3
3. Increase matching by flipping edges in the augmenHng path
x3
y3
x2
y1 y2
2. Search for augmenHng path from x3. stop when an unmatched vertex found. Discard the whole tree when no augmen1ng path is found
AugmenHng path
Repeat the process for other unmatched verHces
MulH-‐source (MS) search for augmenHng paths
x1
x2
x3
y1
y2
y3
1. IniHal matching
x1
x2
x3
y1
y2
y3
3. Increase matching by flipping edges in the augmenHng paths
Repeat the process unHl no augmenHng path is found
x3
y3
x2
y1 y2
2. Search for vertex-‐disjoint augmenHng paths from x3 & x1. Grow a tree unHl an unmatched vertex is found in it. Can not discard a tree if no augmen1ng path is found.
AugmenHng paths x1
y1
Search Forest
Tree GraPing
x1
x2
x3
x4
x5
y1
y2
y3
y4
y5
x6 y6
(a) A maximal matching in a Bipartite Graph
Active Tree Renewable Tree
x1 x2
x3 x4 x5
y1 y2 y3
y4 y5
(b) Alternating BFS Forest Augment in forest
x1
x3 x4
y1 y2
Active Tree
x2
y3
(c) Tree Grafting
x1
x3 x4
y1 y2
Active Tree
x2
y3 y4
y6
x6
(d) Continue BFS
Unvisited Vertices
Comparison with state-‐of-‐the-‐art
0"
4"
8"
12"
16"
kkt_po
wer"
hugetrace"
delaun
ay"
rgg_n2
4_s0"
coPaersDBL
P"
amazon
031
2"
citBPatents"
RMAT
"
road_u
sa"
wbBed
u"
web
BGoo
gle"
wikiped
ia"
RelaIve"pe
rformance" PushBRelabel"
PothenBFan"MSBBFSBGraN"
(a)$one$core$
0"
4"
8"
12"
16"
kkt_power
"huget
race" delaunay"
rgg_n24_s0
"
coPaersDBL
P"
amazon03
12"citBPa
tents"
RMAT"road_
usa" wbBedu"webBG
oogle"wikip
edia"
RelaIve"pe
rformance" 18$ 35$ 29$ 42$(b)$40$cores$
Intel Westmere-‐EX
ScienHfic compuHng Scale-‐free Networks
Sources of performance
0"
2"
4"
6"
8"
Scien,fic" Scale0free" Networks"
Rela,ve""Perform
ance"" MS0BFS"
MS0BFS"(dir0opt)"
MS0BFS0GraE"
1. DirecHon opHmized BFS (1.6x) 2. Tree graPing (3x) On 40 cores
Intel Westmere-‐EX
Highest performance improvement on graphs with low matching number
Ariful Azad, B., Alex Pothen. A parallel tree graPing algorithm for maximum cardinality matching in biparHte graphs. In Proceedings of the IPDPS, 2015. Ariful Azad, B., Alex Pothen. CompuHng maximum cardinality matchings in parallel on biparHte graphs via tree-‐graPing. IEEE TransacHons on Parallel and Distributed Systems (TPDS)), 2016.
16 32 64 128 256 512 1024 20482
4
8
16
32
64
128
256
Number of Cores
Tim
e (s
ec)
ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R 0
8
16
24
32
Dynamic Mindegree
Karp- Sipser
Greedy No Init
Tim
e (s
ec)
nlpkkt200 Maximum Matching Maximal Matching
0!
4!
8!
12!
16!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
road_usa!
0!
2!
4!
6!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
GL7d19!
0!
2!
4!
6!
8!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
wikipedia!
Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016
Matching in distributed memory
Outline
• GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra
• SpGEMM: Computing the sparse matrix-matrix multiplication in parallel
• Triangle counting/enumeration in matrix algebra
• Bipartite graph matching in parallel
• Other contributions and future work
HipMer: An Extreme-‐Scale De Novo Genome Assembler
k-mers
New k-‐mer analysis filters errors using probabilisBc “Bloom Filter”
Graph algorithm (connected components) scales to 15K cores on NERSC’s Edison
contigs
Scaffolding using Scalable Alignment
x x
New fast & parallel I/O
reads
Meraculous Assembly Pipeline Meraculous assembler is used in produc1on at the Joint Genome Ins1tute • Wheat assembly is a “grand challenge” • Hardest part is conHg generaHon (large in-‐
memory hash table that represents graph) • HipMer is an efficient parallelizaHon of
Meraculous that leverages the capabiliHes of Unified Parallel C (UPC)
Performance improvement from days to minutes
16
32
64
128
256
512
1024
2048
4096
8192
1920 3840 7680 15360 23040
Seco
nds
Number of Cores
ideal IO cached overall No IO cached
overall IO cachedkmer analysis
contig generationscaffolding
IO
Future Work: Machine Learning
Sparse -‐ Dense Matrix
Product (SpDM3)
Sparse -‐ Sparse Matrix
Product (SpGEMM)
Sparse Matrix Times MulHple Dense Vectors
(SpMM)
Sparse Matrix-‐Dense Vector
(SpMV)
Sparse Matrix-‐Sparse Vector (SpMSpV)
GraphBLAS funcHons (in increasing arithmeHc intensity)
Graphical Model Structure Learning (e.g., CONCORD)
Clustering (e.g., MCL, Spectral Clustering)
LogisHc Regression,
Support Vector Machines
Dimensionality ReducHon (e.g.,
NMF, CX/CUR, PCA)
Higher-‐level machine learning tasks
Future Work: Metagenomics
• Microbiomes: dynamic consorHa of hundreds or thousands of microbial species and strains of varying abundance and diversity.
• Metagenomics: the applicaHon of high-‐throughput genome sequencing technologies to DNA extracted from microbiomes
Challenges: -‐ Metagenome assembly is
computaHonally harder than single genome
-‐ Protein clustering at this scale has never been done before
Figure credit: Natalia Ivanova (JGI)
Acknowledgments
Ariful Azad, David Bader, Grey Ballard, Sco^ Beamer, Jarrod Chapman, James Demmel, Rob Egan, Evangelos Georganas, John Gilbert, Laura Grigori, Steve Hofmeyr, CosHn Iancu, Jeremy Kepner, Penporn Koanantakool, Benjamin Lipshitz, Tim Ma^son, Sco^ McMillan, Henning Meyerhenke, Jose Moreira, Sang Oh, Lenny Oliker, John Owens, Alex Pothen, Dan Rokhsar, Oded Schwartz, Sivan Toledo, Sam Williams, Carl Yang, Kathy Yelick.
Work is funded by