EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,eﬃcient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

EDGAR •  Energy-‐efficient Data and Graph Algorithms Research •  Funded by Applied Math, ASCR •  Program manager: Sandy Landsberg •  Early Career Research Program (start: 2013) •  PI: Aydın Buluç (Berkeley Lab) •  Postdoctoral Fellows:

– Ariful Azad (100%, Feb 2014-‐now) – Harsha Simhadri (40%, Sep 2013-‐ Sep 2014)

•  Students (short term): –  Chaitanya Aluru (Nov 2014-‐now) –  Eric Lee (Summer 2014)

Problem: maximum cardinality matching in biparWte graph. Applica3on: block triangular forms of matrices, least square problems, circuit simulaWon, weighted matching. Algorithm: Search for disjoint paths alternaWng between matched and unmatched edges.

x1

x2

x3

y1

y2

y3

Maximum Cardinality Matching

Matched

Unmatched

Matched

Unmatched

Vertex Edge

Innova3on: Re-‐use search trees created in one phase in the next phase by gra]ing branches of trees. Significantly reduces work in the tree-‐ traversal and exposes more parallelism.

Ariful Azad, Aydın Buluç, and Alex Pothen. A parallel tree gra]ing algorithm for maximum cardinality matching in biparWte graphs. In Proceedings of the IPDPS, 2015.

0

4

8

12

16

kkt_power

huget

race delaunay

rgg_n24_s0

coPaersDBL

P

amazon03

12 cit-‐Pa

tents

RMAT road_

usa wb-‐edu web-‐G

oogle wikip

edia

RelaWve pe

rformance

Push-‐Relabel Pothen-‐Fan MS-‐BFS-‐Gra]

18 35 29 42

Performance: On 40-‐core Intel Westmere-‐EX On average 7x faster than current best algorithm. Can be up to 42x faster.

1

2

4

8

16

32

1 4 16 64

Speedu

p

Number of Threads

ScienWfic

Scale-‐free

Networks

Hyperthreading

Change of socket

Scaling: One node of Edison (24-‐core Intel Ivy Bridge). On average 17x speedups

relaWve to serial algorithm.

Future direc3on: expose parallelism suitable for distributed matching (no pracWcal algorithm is available); apply to staWc pivoWng for solving systems of sparse linear equaWons.

0

200

400

600

800

1000

1200

1400

480 960 1920 3840 7680 0

0.2

0.4

0.6

0.8

1Se

cond

s

Para

llel e

ffici

ency

Number of Cores

Graph construction & traversal K-mer analysisCombined time

Parallel efficiency

reads

contigs

scaffolds

k-mers

2

K-‐mer analysis

Graph const. & traversal (conWg generaWon)

1

3 Scaffolding (alignment)

GAT ATC TCT CTG TGA AAC

ACC

CCG

AAT

ATG

TGC

ConWg 1: GATCTGA ConWg 2: AACCG

ConWg 3: AATGC

Scaling of Steps 1 and 2 on NERSC Edison with Human Genome

Improvement for first two steps: Human: 60 hours à 2 minutes Wheat: 140 hours à 15 minutes

Parallel Algorithms for De Novo Genome Assembly

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. Parallel de bruijn graph construcWon and traversal for de novo genome assembly. In SupercompuWng (SC’14).

•  In de novo assembly, billions of reads must be aligned to conWgs •  First aligner to parallelize the seed index construcWon (“fully” parallel)

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, and Katherine Yelick. meraligner: A fully parallel sequence aligner. In Proceedings of the IPDPS, 2015.

128

256

512

1024

2048

4096

8192

16384

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human

Parallel Genome Alignment for De novo Assembly

•  Block iteraWve methods: -‐ have data locality and cache re-‐use -‐ expose more parallelism

•  Sparse matrix vector block mulWplicaWon •  MoWvated by nuclear structure calculaWons •  ApplicaWons: data analysis & spectral clustering

H. MeWn Aktulga, Aydın Buluç, Samuel Williams, and Chao Yang. OpWmizing sparse matrix-‐mulWple vectors mulWplicaWon for nuclear configuraWon interacWon calculaWons. In Proceedings of the IPDPS, 2014

Fast Parallel Block Eigensolvers

n ×n matrix with nnz nonzeroes, in β×β blocks

n 0,0 0,1

1,0 1,1

β

0 10 20 30 40 50 60 70 80 90

1 4 8 12 16 24 32 48

GFl

op/s

#vectors (m)

CSB/OpenMP CSB/Cilk CSR/OpenMP

0 10 20 30 40 50 60 70 80 90

1 4 8 12 16 24 32 48

GFl

op/s

#vectors (m)

CSB/OpenMP CSB/Cilk rowpart/OpenMP CSR/OpenMP

SpMM SpMM_T

Adam Lugowski, Shoaib Kamil, Aydın Buluç̧, Sam Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John Gilbert. Parallel processing of filtered queries in aqributed semanWc graphs. Journal of Parallel & Dist. Comp. (JPDC)), 2014

Filtered Seman3c Graph Processing

KDT$Algorithm$

CombBLAS$Primi4ve$

Filter$(Py)$

Python'

C++'

Semiring$(Py)$KDT$Algorithm$

CombBLAS$Primi4ve$ Filter$(C++)$

Semiring$(C++)$

Standard$KDT$ KDT+SEJITS$

SEJITS$$$$Transla4on$

Filter$(Py)$

Semiring$(Py)$

•  Filters enable efficient processing of semanWc graphs (with edge/vertex aqributes)

•  Semirings enable customizaWon of graph algorithms using matrix primiWves

•  Filters & semirings can be a performance boqleneck in high-‐level languages

•  SEJITS enable just-‐in-‐Wme compilaWon of filters/semirings into Combinatorial BLAS, matching performance low-‐level code

0.1

1

10

1% 10% 100%

Mean B

FS

Tim

e (

seco

nds,

log s

cale

)

Filter Permeability

Python/Python KDTPython/SEJITS KDTSEJITS/SEJITS KDTC++/C++ CombBLAS

•  The Graph BLAS Forum: http://istc-bigdata.org/GraphBlas/ •  Graph Algorithms Building Blocks (GABB workshop at IPDPS’14 and IPDPS’15):

http://www.graphanalysis.org/workshop2015.html

Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

The Graph BLAS Effort

Up to 8X faster on Kronecker (Graph500) inputs (largest run on 16 billion verWces and 256 billion edges)

8X

For a fixed-‐sized real input, direcWon-‐opWmizing algorithm needs 1/16th of processors (and energy)

Implemented on top of Combinatorial BLAS (i.e. uses 2D decomposiWon)

16X

Buluç̧, Beamer, Madduri, Asanović, Paqerson, “Distributed-‐Memory Breadth-‐First Search on Massive Graphs”, Chapter in Parallel Graph Algorithms, Bader (editor), CRC Press / Taylor & Francis (2015), to appear Beamer, B., Asanović, Paqerson, "Distributed Memory Breadth-‐First Search Revisited: Enabling Boqom-‐Up Search”, IPDPSW’13

Direc3on-‐op3mizing BFS on distributed memory

c=1

0

200

400

600

800

1000

12001 4 16 64 256

1024 1 4 16 64 256

1024

GFl

ops

Number of compute nodes

n=4096

n=8192

c=16c=4 A B

C D A

B D

C

A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

+ is “min”, × is “add”

V1 V2

Solomonik, Buluç̧, Demmel. “Minimizing communicaWon in all-‐pairs shortest paths”, IPDPS. 2013.

6.2x speedup

2x speedup

Strong Scaling on Hopper (Cray XE6 up to 1024 nodes = 24576 cores)

Algorithm based on Kleene’s recursive formulation on the (min,+) semiring

Communica3on-‐avoiding All-‐Pairs Shortest-‐Paths

Matrix multiplication: ∀(i,j) ∈ n x n,

C(i,j) = Σk A(i,k)B(k,j),

A B C

Previous Sparse Classical Lower Bound:

[Ballard, et al. SIMAX’11]

( ) ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

•  Expected (Under some technical assumpWons) •  Two new algorithms (3D itera3ve & recursive)

that aqain the new lower bound •  No previous algorithm aqain these.

⎟⎟⎠

⎞⎜⎜⎝

⎛

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min ü

Ballard, Buluç̧, Demmel, Grigori, Lipshitz, Schwartz, and Toledo. CommunicaWon opWmal parallel mulWplicaWon of sparse random matrices. In SPAA 2013.

Assump3on: Assignment of data and work to processors is sparsity-‐paqern-‐independent

Communica3on Op3mal Sparse Matrix Mul3plica3on

Documents

EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,eﬃcient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&