Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
EDGAR • Energy-‐efficient Data and Graph Algorithms Research • Funded by Applied Math, ASCR • Program manager: Sandy Landsberg • Early Career Research Program (start: 2013) • PI: Aydın Buluç (Berkeley Lab) • Postdoctoral Fellows:
– Ariful Azad (100%, Feb 2014-‐now) – Harsha Simhadri (40%, Sep 2013-‐ Sep 2014)
• Students (short term): – Chaitanya Aluru (Nov 2014-‐now) – Eric Lee (Summer 2014)
Problem: maximum cardinality matching in biparWte graph. Applica3on: block triangular forms of matrices, least square problems, circuit simulaWon, weighted matching. Algorithm: Search for disjoint paths alternaWng between matched and unmatched edges.
x1
x2
x3
y1
y2
y3
Maximum Cardinality Matching
Matched
Unmatched
Matched
Unmatched
Vertex Edge
Innova3on: Re-‐use search trees created in one phase in the next phase by gra]ing branches of trees. Significantly reduces work in the tree-‐ traversal and exposes more parallelism.
Ariful Azad, Aydın Buluç, and Alex Pothen. A parallel tree gra]ing algorithm for maximum cardinality matching in biparWte graphs. In Proceedings of the IPDPS, 2015.
0
4
8
12
16
kkt_power
huget
race delaunay
rgg_n24_s0
coPaersDBL
P
amazon03
12 cit-‐Pa
tents
RMAT road_
usa wb-‐edu web-‐G
oogle wikip
edia
RelaWve pe
rformance
Push-‐Relabel Pothen-‐Fan MS-‐BFS-‐Gra]
18 35 29 42
Performance: On 40-‐core Intel Westmere-‐EX On average 7x faster than current best algorithm. Can be up to 42x faster.
1
2
4
8
16
32
1 4 16 64
Speedu
p
Number of Threads
ScienWfic
Scale-‐free
Networks
Hyperthreading
Change of socket
Scaling: One node of Edison (24-‐core Intel Ivy Bridge). On average 17x speedups
relaWve to serial algorithm.
Future direc3on: expose parallelism suitable for distributed matching (no pracWcal algorithm is available); apply to staWc pivoWng for solving systems of sparse linear equaWons.
0
200
400
600
800
1000
1200
1400
480 960 1920 3840 7680 0
0.2
0.4
0.6
0.8
1Se
cond
s
Para
llel e
ffici
ency
Number of Cores
Graph construction & traversal K-mer analysisCombined time
Parallel efficiency
reads
contigs
scaffolds
k-mers
2
K-‐mer analysis
Graph const. & traversal (conWg generaWon)
1
3 Scaffolding (alignment)
GAT ATC TCT CTG TGA AAC
ACC
CCG
AAT
ATG
TGC
ConWg 1: GATCTGA ConWg 2: AACCG
ConWg 3: AATGC
Scaling of Steps 1 and 2 on NERSC Edison with Human Genome
Improvement for first two steps: Human: 60 hours à 2 minutes Wheat: 140 hours à 15 minutes
Parallel Algorithms for De Novo Genome Assembly
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. Parallel de bruijn graph construcWon and traversal for de novo genome assembly. In SupercompuWng (SC’14).
• In de novo assembly, billions of reads must be aligned to conWgs • First aligner to parallelize the seed index construcWon (“fully” parallel)
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. meraligner: A fully parallel sequence aligner. In Proceedings of the IPDPS, 2015.
128
256
512
1024
2048
4096
8192
16384
480 960 1920 3840 7680 15360
Seco
nds
Number of Cores
merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human
Parallel Genome Alignment for De novo Assembly
• Block iteraWve methods: -‐ have data locality and cache re-‐use -‐ expose more parallelism
• Sparse matrix vector block mulWplicaWon • MoWvated by nuclear structure calculaWons • ApplicaWons: data analysis & spectral clustering
H. MeWn Aktulga, Aydın Buluç, Samuel Williams, and Chao Yang. OpWmizing sparse matrix-‐mulWple vectors mulWplicaWon for nuclear configuraWon interacWon calculaWons. In Proceedings of the IPDPS, 2014
Fast Parallel Block Eigensolvers
n ×n matrix with nnz nonzeroes, in β×β blocks
n 0,0 0,1
1,0 1,1
β
0 10 20 30 40 50 60 70 80 90
1 4 8 12 16 24 32 48
GFl
op/s
#vectors (m)
CSB/OpenMP CSB/Cilk CSR/OpenMP
0 10 20 30 40 50 60 70 80 90
1 4 8 12 16 24 32 48
GFl
op/s
#vectors (m)
CSB/OpenMP CSB/Cilk rowpart/OpenMP CSR/OpenMP
SpMM SpMM_T
Adam Lugowski, Shoaib Kamil, Aydın Buluç̧, Sam Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John Gilbert. Parallel processing of filtered queries in aqributed semanWc graphs. Journal of Parallel & Dist. Comp. (JPDC)), 2014
Filtered Seman3c Graph Processing
KDT$Algorithm$
CombBLAS$Primi4ve$
Filter$(Py)$
Python'
C++'
Semiring$(Py)$KDT$Algorithm$
CombBLAS$Primi4ve$ Filter$(C++)$
Semiring$(C++)$
Standard$KDT$ KDT+SEJITS$
SEJITS$$$$Transla4on$
Filter$(Py)$
Semiring$(Py)$
• Filters enable efficient processing of semanWc graphs (with edge/vertex aqributes)
• Semirings enable customizaWon of graph algorithms using matrix primiWves
• Filters & semirings can be a performance boqleneck in high-‐level languages
• SEJITS enable just-‐in-‐Wme compilaWon of filters/semirings into Combinatorial BLAS, matching performance low-‐level code
0.1
1
10
1% 10% 100%
Mean B
FS
Tim
e (
seco
nds,
log s
cale
)
Filter Permeability
Python/Python KDTPython/SEJITS KDTSEJITS/SEJITS KDTC++/C++ CombBLAS
• The Graph BLAS Forum: http://istc-bigdata.org/GraphBlas/ • Graph Algorithms Building Blocks (GABB workshop at IPDPS’14 and IPDPS’15):
http://www.graphanalysis.org/workshop2015.html
Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.
The Graph BLAS Effort
Up to 8X faster on Kronecker (Graph500) inputs (largest run on 16 billion verWces and 256 billion edges)
8X
For a fixed-‐sized real input, direcWon-‐opWmizing algorithm needs 1/16th of processors (and energy)
Implemented on top of Combinatorial BLAS (i.e. uses 2D decomposiWon)
16X
Buluç̧, Beamer, Madduri, Asanović, Paqerson, “Distributed-‐Memory Breadth-‐First Search on Massive Graphs”, Chapter in Parallel Graph Algorithms, Bader (editor), CRC Press / Taylor & Francis (2015), to appear Beamer, B., Asanović, Paqerson, "Distributed Memory Breadth-‐First Search Revisited: Enabling Boqom-‐Up Search”, IPDPSW’13
Direc3on-‐op3mizing BFS on distributed memory
c=1
0
200
400
600
800
1000
12001 4 16 64 256
1024 1 4 16 64 256
1024
GFl
ops
Number of compute nodes
n=4096
n=8192
c=16c=4 A B
C D A
B D
C
A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;
+ is “min”, × is “add”
V1 V2
Solomonik, Buluç̧, Demmel. “Minimizing communicaWon in all-‐pairs shortest paths”, IPDPS. 2013.
6.2x speedup
2x speedup
Strong Scaling on Hopper (Cray XE6 up to 1024 nodes = 24576 cores)
Algorithm based on Kleene’s recursive formulation on the (min,+) semiring
Communica3on-‐avoiding All-‐Pairs Shortest-‐Paths
Matrix multiplication: ∀(i,j) ∈ n x n,
C(i,j) = Σk A(i,k)B(k,j),
A B C
Previous Sparse Classical Lower Bound:
[Ballard, et al. SIMAX’11]
( ) ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⋅ΩPM
M
FLOPs3
#⎟⎟⎠
⎞⎜⎜⎝
⎛Ω=
MPnd 2
New Lower bound for Erdős-Rényi(n,d) :
• Expected (Under some technical assumpWons) • Two new algorithms (3D itera3ve & recursive)
that aqain the new lower bound • No previous algorithm aqain these.
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎭⎬⎫
⎩⎨⎧
ΩPnd
Pdn 2
,min ü
Ballard, Buluç̧, Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaWon opWmal parallel mulWplicaWon of sparse random matrices. In SPAA 2013.
Assump3on: Assignment of data and work to processors is sparsity-‐paqern-‐independent
Communica3on Op3mal Sparse Matrix Mul3plica3on