11
EDGAR Energyefficient Data and Graph Algorithms Research Funded by Applied Math, ASCR Program manager: Sandy Landsberg Early Career Research Program (start: 2013) PI: Aydın Buluç (Berkeley Lab) Postdoctoral Fellows: Ariful Azad (100%, Feb 2014now) Harsha Simhadri (40%, Sep 2013 Sep 2014) Students (short term): Chaitanya Aluru (Nov 2014now) Eric Lee (Summer 2014)

EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

EDGAR  •  Energy-­‐efficient  Data  and  Graph  Algorithms  Research  •  Funded  by  Applied  Math,  ASCR  •  Program  manager:  Sandy  Landsberg  •  Early  Career  Research  Program  (start:  2013)  •  PI:  Aydın  Buluç  (Berkeley  Lab)  •  Postdoctoral  Fellows:    

– Ariful  Azad  (100%,  Feb  2014-­‐now)  – Harsha  Simhadri  (40%,  Sep  2013-­‐  Sep  2014)  

•  Students  (short  term):    –  Chaitanya  Aluru  (Nov  2014-­‐now)  –  Eric  Lee  (Summer  2014)  

Page 2: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

Problem:  maximum  cardinality  matching  in  biparWte  graph.    Applica3on:  block  triangular  forms  of  matrices,  least  square  problems,  circuit  simulaWon,  weighted  matching.      Algorithm:  Search  for  disjoint  paths  alternaWng  between  matched  and  unmatched  edges.        

x1

x2

x3

y1

y2

y3

Maximum  Cardinality  Matching  

Matched  

Unmatched  

Matched  

Unmatched  

Vertex   Edge  

Innova3on:  Re-­‐use    search  trees  created  in  one  phase  in  the  next  phase    by  gra]ing  branches  of  trees.  Significantly  reduces  work  in  the  tree-­‐  traversal  and  exposes  more  parallelism.    

Ariful  Azad,  Aydın  Buluç,  and  Alex  Pothen.  A  parallel  tree  gra]ing  algorithm  for  maximum  cardinality  matching  in  biparWte  graphs.  In  Proceedings  of  the  IPDPS,  2015.  

Page 3: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

0  

4  

8  

12  

16  

kkt_power

 huget

race   delaunay  

rgg_n24_s0

 

coPaersDBL

P  

amazon03

12  cit-­‐Pa

tents  

RMAT  road_

usa   wb-­‐edu  web-­‐G

oogle  wikip

edia  

RelaWve  pe

rformance  

Push-­‐Relabel  Pothen-­‐Fan  MS-­‐BFS-­‐Gra]  

18   35   29   42  

Performance:    On  40-­‐core  Intel    Westmere-­‐EX  On  average  7x  faster  than  current  best  algorithm.  Can  be  up  to  42x  faster.  

1  

2  

4  

8  

16  

32  

1   4   16   64  

Speedu

p  

Number  of  Threads  

ScienWfic  

Scale-­‐free  

Networks  

Hyperthreading  

Change  of  socket  

Scaling:  One  node  of  Edison    (24-­‐core  Intel  Ivy  Bridge).  On  average  17x  speedups  

relaWve  to  serial  algorithm.  

Future  direc3on:  expose  parallelism  suitable  for  distributed  matching  (no  pracWcal  algorithm  is  available);  apply  to  staWc  pivoWng  for  solving  systems  of  sparse  linear  equaWons.  

Page 4: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

0

200

400

600

800

1000

1200

1400

480 960 1920 3840 7680 0

0.2

0.4

0.6

0.8

1Se

cond

s

Para

llel e

ffici

ency

Number of Cores

Graph construction & traversal K-mer analysisCombined time

Parallel efficiency

reads

contigs

scaffolds

k-mers

2

K-­‐mer  analysis  

Graph  const.  &  traversal  (conWg  generaWon)  

1

3 Scaffolding  (alignment)  

GAT ATC TCT CTG TGA AAC

ACC

CCG

AAT

ATG

TGC

ConWg  1:  GATCTGA  ConWg  2:    AACCG  

ConWg  3:  AATGC  

Scaling  of  Steps  1  and  2  on  NERSC  Edison  with  Human  Genome    

Improvement  for  first  two  steps:  Human:  60  hours  à  2  minutes  Wheat:    140  hours  à  15  minutes    

Parallel  Algorithms  for  De  Novo    Genome  Assembly  

Evangelos  Georganas,  Aydın  Buluç,  Jarrod  Chapman,  Leonid  Oliker,  Daniel  Rokhsar,  and  Katherine  Yelick.  Parallel  de  bruijn  graph  construcWon  and  traversal  for  de  novo  genome  assembly.  In  SupercompuWng  (SC’14).    

Page 5: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

•  In  de  novo  assembly,  billions  of  reads  must  be  aligned  to  conWgs  •  First  aligner  to  parallelize  the  seed  index  construcWon  (“fully”  parallel)    

Evangelos  Georganas,  Aydın  Buluç,  Jarrod  Chapman,  Leonid  Oliker,  Daniel  Rokhsar,  and  Katherine  Yelick.  meraligner:  A  fully  parallel  sequence  aligner.  In  Proceedings  of  the  IPDPS,  2015.    

128

256

512

1024

2048

4096

8192

16384

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human

Parallel  Genome  Alignment  for  De  novo  Assembly    

Page 6: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

•  Block  iteraWve  methods:    -­‐  have  data  locality  and  cache  re-­‐use    -­‐  expose  more  parallelism  

•  Sparse  matrix  vector  block  mulWplicaWon  •  MoWvated  by  nuclear  structure  calculaWons  •  ApplicaWons:  data  analysis  &  spectral  clustering    

H.  MeWn  Aktulga,  Aydın  Buluç,  Samuel  Williams,  and  Chao  Yang.  OpWmizing  sparse  matrix-­‐mulWple  vectors  mulWplicaWon  for  nuclear  configuraWon  interacWon  calculaWons.  In  Proceedings  of  the  IPDPS,  2014  

Fast  Parallel  Block  Eigensolvers  

n ×n matrix with nnz nonzeroes, in β×β blocks

n   0,0   0,1  

1,0   1,1  

β  

0 10 20 30 40 50 60 70 80 90

1 4 8 12 16 24 32 48

GFl

op/s

#vectors (m)

CSB/OpenMP CSB/Cilk CSR/OpenMP

0 10 20 30 40 50 60 70 80 90

1 4 8 12 16 24 32 48

GFl

op/s

#vectors (m)

CSB/OpenMP CSB/Cilk rowpart/OpenMP CSR/OpenMP

SpMM                                    SpMM_T  

Page 7: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

Adam  Lugowski,  Shoaib  Kamil,  Aydın  Buluç̧,  Sam  Williams,  Erika  Duriakova,  Leonid  Oliker,  Armando  Fox,  John  Gilbert.  Parallel  processing  of  filtered  queries  in  aqributed  semanWc  graphs.  Journal  of  Parallel  &  Dist.  Comp.    (JPDC)),  2014    

Filtered  Seman3c  Graph  Processing  

KDT$Algorithm$

CombBLAS$Primi4ve$

Filter$(Py)$

Python'

C++'

Semiring$(Py)$KDT$Algorithm$

CombBLAS$Primi4ve$ Filter$(C++)$

Semiring$(C++)$

Standard$KDT$ KDT+SEJITS$

SEJITS$$$$Transla4on$

Filter$(Py)$

Semiring$(Py)$

•  Filters  enable  efficient  processing  of  semanWc  graphs  (with  edge/vertex  aqributes)  

•  Semirings  enable  customizaWon  of  graph  algorithms  using  matrix  primiWves  

•  Filters  &  semirings  can  be  a  performance  boqleneck  in  high-­‐level  languages  

•  SEJITS  enable  just-­‐in-­‐Wme  compilaWon  of  filters/semirings  into  Combinatorial  BLAS,  matching  performance  low-­‐level  code  

0.1

1

10

1% 10% 100%

Mean B

FS

Tim

e (

seco

nds,

log s

cale

)

Filter Permeability

Python/Python KDTPython/SEJITS KDTSEJITS/SEJITS KDTC++/C++ CombBLAS

Page 8: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

•  The Graph BLAS Forum: http://istc-bigdata.org/GraphBlas/ •  Graph Algorithms Building Blocks (GABB workshop at IPDPS’14 and IPDPS’15):

http://www.graphanalysis.org/workshop2015.html

Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

The  Graph  BLAS  Effort  

Page 9: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

Up  to  8X  faster  on  Kronecker  (Graph500)  inputs  (largest  run  on  16  billion  verWces  and  256  billion  edges)  

   

8X  

For  a  fixed-­‐sized  real  input,  direcWon-­‐opWmizing  algorithm  needs  1/16th  of  processors  (and  energy)    

Implemented  on  top  of  Combinatorial  BLAS  (i.e.  uses  2D  decomposiWon)  

16X  

Buluç̧,  Beamer,  Madduri,  Asanović,  Paqerson,  “Distributed-­‐Memory  Breadth-­‐First  Search  on  Massive  Graphs”,  Chapter  in  Parallel  Graph  Algorithms,  Bader  (editor),  CRC  Press  /  Taylor  &  Francis  (2015),  to  appear  Beamer,  B.,  Asanović,  Paqerson,  "Distributed  Memory  Breadth-­‐First  Search  Revisited:  Enabling  Boqom-­‐Up  Search”,  IPDPSW’13    

Direc3on-­‐op3mizing  BFS  on  distributed  memory    

Page 10: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

c=1

0

200

400

600

800

1000

12001 4 16 64 256

1024 1 4 16 64 256

1024

GFl

ops

Number of compute nodes

n=4096

n=8192

c=16c=4 A B

C D A

B D

C

A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

+ is “min”, × is “add”

V1 V2

Solomonik,  Buluç̧,  Demmel.  “Minimizing  communicaWon  in  all-­‐pairs  shortest  paths”,  IPDPS.  2013.  

6.2x speedup

2x speedup

Strong Scaling on Hopper (Cray XE6 up to 1024 nodes = 24576 cores)

Algorithm based on Kleene’s recursive formulation on the (min,+) semiring  

Communica3on-­‐avoiding  All-­‐Pairs  Shortest-­‐Paths  

Page 11: EDGAR - Lawrence Berkeley National Laboratory · 2017-08-04 · EDGAR • Energy,efficient Dataand& Graph&Algorithms&Research& • Funded&by&Applied&Math,&ASCR • Program&manager:&Sandy&Landsberg&

Matrix multiplication: ∀(i,j)  ∈  n  x  n,  

C(i,j)  =  Σk  A(i,k)B(k,j),

A B C

Previous Sparse Classical Lower Bound:

[Ballard,  et  al.  SIMAX’11]  

( ) ⎟⎟

⎜⎜

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

•  Expected  (Under  some  technical  assumpWons)  •  Two  new  algorithms  (3D  itera3ve  &  recursive)  

that  aqain  the  new  lower  bound    •  No  previous  algorithm  aqain  these.  

⎟⎟⎠

⎞⎜⎜⎝

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min ü  

Ballard,  Buluç̧,  Demmel,  Grigori,  Lipshitz,  Schwartz,  and  Toledo.  CommunicaWon  opWmal  parallel  mulWplicaWon  of  sparse  random  matrices.  In  SPAA  2013.  

Assump3on:  Assignment  of  data  and  work  to  processors  is  sparsity-­‐paqern-­‐independent    

Communica3on  Op3mal  Sparse  Matrix  Mul3plica3on