28
LLVM Optimizations for PGAS Programs Case Study: LLVM Wide Pointer Optimizations in Chapel CHIUW2014 (colocated with IPDPS 2014), Phoenix, Arizona Akihiro Hayashi , Rishi Surendran, Jisheng Zhao, Vivek Sarkar (Rice University), Michael Ferguson (Laboratory for Telecommunication Sciences) 1

LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

LLVM  Optimizations  for  PGAS  Programs-‐‑‒Case  Study:  LLVM  Wide  Pointer  Optimizations  in  Chapel-‐‑‒

CHIUW2014  (co-‐‑‒located  with  IPDPS  2014),  Phoenix,  Arizona

Akihiro  Hayashi,  Rishi  Surendran,  Jisheng  Zhao,  Vivek  Sarkar  

(Rice  University),  Michael  Ferguson

 (Laboratory  for  Telecommunication  Sciences)1

Page 2: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Background:  Programming  Model  for  Large-‐‑‒scale  Systemsp Message  Passing  Interface  (MPI)  is  a  ubiquitous  programming  model  n but  introduces  non-‐‑‒trivial  complexity  due  to  message  passing  semantics

p PGAS  languages  such  as  Chapel,  X10,  Habanero-‐‑‒C  and  Co-‐‑‒array  Fortran  provide  high-‐‑‒productivity  features:n Task  parallelismn Data  Distributionn Synchronization

2

Page 3: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Motivation:  Chapel  Support  for  LLVM

p Widely  used  and  easy  to  Extend3

LLVM  Intermediate  Representa1on  (IR)  

x86  Binary  

C/C++  Frontend  Clang  

C/C++,  Fortran,  Ada,  Objec1ve-­‐C  Frontend  dragonegg  

Chapel  Compiler  chpl  

PPC  Binary   ARM  Binary  

x86  backend  

Power  PC  backend  

ARM  backend  

PTX  backend  

Analysis  &  Op1miza1ons  

GPU  Binary  

UPC  Compiler  

Page 4: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

A  Big  Picture

4 Pictures  borrowed  from  hSp://chapel.cray.com/logo.html,  hSp://llvm.org/Logo.html,  hSp://upc.lbl.gov/,  hSp://commons.wikimedia.org/,  hSps://www.olcf.ornl.gov/1tan/  

Habanero-‐‑‒C,  ...

©  Argonne  Na1onal  Lab.  

©Oak  Ridge  Na1onal  Lab.  

©  RIKEN  AICS  

Page 5: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Our  ultimate  goal:  A  compiler  that  can  uniformly  optimize  PGAS  Programsp Extend  LLVM  IR  to  support  parallel  programs  with  PGAS  and  explicit  task  parallelismn Two  parallel  intermediate  representations(PIR)  as  extensions  to  LLVM  IR  (Runtime-‐‑‒Independent,  Runtime-‐‑‒Specific)

5

Parallel  Programs  

(Chapel,  X10,  CAF,  HC,  …)  

1.RI-­‐PIR  Gen  2.Analysis  

3.Transforma1on  

1.RS-­‐PIR  Gen  2.Analysis  

3.Transforma1on  

LLVM    Run3me-­‐Independent  

Op3miza3ons  e.g.  Task  Parallel  Construct  

LLVM    Run3me-­‐Specific  Op3miza3ons  e.g.  GASNet  API  

Binary  

Page 6: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

The  first  step:LLVM-‐‑‒based  Chapel  compiler

6

Pictures  borrowed  from  1)  hSp://chapel.cray.com/logo.html                                                                                                                                                                                                            2)  hSp://llvm.org/Logo.html    

p Chapel  compiler  supports  LLVM  IR  generationp This  talk  discusses  the  pros  and  cons  of  LLVM-‐‑‒based  communication  optimizations  for  Chapeln Wide  pointer  optimization

p Preliminary  Performance  evaluation  &  analysis  using  three  regular  applications

Page 7: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Chapel  languagep An  object-‐‑‒oriented  PGAS  language  developed  by  Cray  Inc.n Part  of  DARPA  HPCS  program

p Key  featuresn Array  Operators:  zip,  replicate,  remap,...n Explicit  Task  Parallelism:  begin,  cobeginn Locality  Control:  Localesn Data-‐‑‒Distribution:  domain  mapsn Synchronizations:  sync

7

Page 8: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Compilation  Flow

8

Chapel  Programs  

AST  Genera1on  and  Op1miza1ons  

C-­‐code  Genera1on  

LLVM  Op1miza1ons  Backend  Compiler’s  Op1miza1ons  (e.g.  gcc  –O3)  

LLVM  IR  C  Programs    

LLVM  IR  Genera1on  

Binary   Binary  

Page 9: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

The  Pros  and  Cons  of  using  LLVM  for  Chapelp Pro:  Using  address  space  feature  of  LLVM  offers  more  opportunities  for  communication  optimization  than  C  gen

9

// LLVM IR%x = load i64 addrspace(100)* %xptr

// C-Code generationchpl_comm_get(&x, …);

LLVM  Op1miza1ons  (e.g.  LICM,  scalar  replacement)  

Backend  Compiler’s  Op1miza1ons  (e.g.  gcc  –O3)  

Few  chances  of  op1miza1on  because  remote  accesses  are    

lowered  to  chapel  Comm  APIs  

1.  the  exis1ng  LLVM  passes  can  be  used  for  communica1on  

op1miza1ons  2.  Lowered  to  chapel  Comm  APIs  ader  

op1miza1ons  

// Chapel x = remoteData;

Page 10: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Address  Space  100  generation  in  Chapel

p Address  space  100  =  possibly-‐‑‒remote  (our  convention)

p Constructs  which  generate  address  space  100    n  Array  Load/Store  (Except  Local  constructs)  n  Distributed  Array

p var  d  =  {1..128}  dmapped  Block(boundingBox={1..128});p var  A:  [d]  int;

n Object  and  Field  Load/  Storep class  circle  {  var  radius:  real;  …  }p var  c1  =  new  circle(radius=1.0);    

n On  statementp var  loc0:  int;p on  Locales[1]  {  loc0  =  …;  }

n  Ref  intentp proc  habanero(ref  v:  int):  void  {  v  =  …;  }

10

Except  remote  value    forwarding  op1miza1on

Page 11: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Motivating  Example  of  address  space  100

11

(Pseudo-Code: Before LICM)for i in 1..N { // REMOTE GET %x = load i64 addrspace(100)* %xptr A(i) = %x;}

(Pseudo-Code: After LICM)// REMOTE GET%x = load i64 addrspace(100)* %xptrfor i in 1..N { A(i) = %x;}

LICM  by  LLVM  

LICM  =  Loop  Invariant  Code  Mo1on  

Page 12: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

The  Pros  and  Cons  of  using  LLVM  for  Chapel  (Contʼ’d)p Drawback:  Using  LLVM  may  lose  opportunity  for  optimizations  and  may  add  overhead  at  runtime    

p In  LLVM  3.3,  many  optimizations  assume  that  the  pointer  size  is  the  same  across  all  address  spaces  

12

typedef  struct  wide_ptr_s  {      chpl_localeID_t  locale;      void*  addr;  }  wide_ptr_t;  

locale   addr  

For  LLVM  Code  Genera1on  :    64bit  packed  pointer  

CHPL_WIDE_POINTERS=node16  

For  C  Code  Genera1on  :    128bit  struct  pointer  

CHPL_WIDE_POINTERS=struct    

wide.locale;wide.addr;

wide >> 48 wide | 48BITS_MASK;

16bit                                                  48bit  

1.  Needs  more  instruc1ons  2.  Lose  opportuni1es  for  Alias  analysis    

Page 13: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Performance  Evaluations:Experimental  Methodologiesp We  tested  execution  in  the  following  modes

n 1.C-‐‑‒Struct  (-‐‑‒-‐‑‒fast)p C  code  generation  +  struct  pointer  +  gccp Conventional  Code  generation  in  Chapel

n 2.LLVM  without  wide  optimization  (-‐‑‒-‐‑‒fast  -‐‑‒-‐‑‒llvm)p LLVM  IR  generation  +  packed  pointerp Does  not  use  address  space  feature

n 3.LLVM  with  wide  optimization(-‐‑‒-‐‑‒fast  -‐‑‒-‐‑‒llvm  -‐‑‒-‐‑‒llvm-‐‑‒wide-‐‑‒opt)p LLVM  IR  generation  +  packed  pointerp Use  address  space  feature  and  apply  the  existing  LLVM  optimizations

13

Page 14: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Performance  Evaluations:Platformp Intel  Xeon-‐‑‒based  Cluster

n Per  Node  informationp Intel  Xeon  CPU  [email protected]  x  12  coresp 48GB  of  RAM

p Interconnectn Quad-‐‑‒data  rated  Infinibandn Mellanox  FCA  support

14

Page 15: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Performance  Evaluations:Details  of  Compiler  &  Runtime  p Compiler:  Chapel  version  1.9.0.23154  (Apr.  2014)n Built  with

p CHPL_̲LLVM=llvm  p CHPL_̲WIDE_̲POINTERS=node16  or  struct  p CHPL_̲COMM=gasnet  CHPL_̲COMM_̲SUBSTRATE=ibv  p CHPL_̲TASK=qthread

n Backend  compiler:  gcc-‐‑‒4.4.7,  LLVM  3.3p Runtime:

n GASNet-‐‑‒1.22.0  (ibv-‐‑‒conduit,  mpi-‐‑‒spawner)n qthreads-‐‑‒1.10  (2  Shepherds,  6  worker  per  shepherd) 15

Page 16: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Stream-‐‑‒EP

p From  HPCC  benchmarkp Array  Size:  2^30

16

coforall  loc  in  Locales  do  on  loc  {          //  per  each  locale          var  A,  B,  C:  [D]  real(64);          forall  (a,  b,  c)  in  zip(A,  B,  C)  do                  a  =  b  +  alpha  *  c;  }  

Page 17: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Stream-‐‑‒EP  Result

17

2.56  

1.33  0.72  

0.41   0.24   0.11  

6.62  

3.22  

1.73  

1.01  0.62  

0.26  

2.45  

1.28  0.72  

0.40   0.25   0.10  0  

1  

2  

3  

4  

5  

6  

7  

1  locale   2  locales   4  locales   8  locales   16  locales   32  locales  

Execu3

on  Tim

e  (sec)  

Number  of  Locales  

C-­‐Struct  LLVM  w/o  wopt  LLVM  w/  wopt  

Lower  is  beLer  

vs.  vs.   Overhead  of  introducing  LLVM  +  packed  pointer  (2.6x  slower)    

1

2 3

123

__  vs.  

Performance  improvement  by  LLVM  opt  (2.7x  faster)    

LLVM+wide  opt  is  faster  than  the  conven1onal  C-­‐Struct  (1.1x)  

Page 18: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Stream-‐‑‒EP  Analysis

18

C-­‐Struct   LLVM  w/o  wopt   LLVM  w/  wopt  

1.39E+11   1.40E+11   5.46E+10  

Dynamic  number  of  Chapel  PUT/GET  APIs  actually  executed  (16  Locales):  

//  C-­‐Struct,  LLVM  w/o  wopt  forall  (a,  b,  c)  in  zip(A,  B,  C)  do            8GETS  /  1PUT  

//  LLVM  w/  wopt  6GETS  (Get  Array  Head,  offs)  forall  (a,  b,  c)  in  zip(A,  B,  C)  do            2GETS  /  1PUT  

LICM  by  LLVM  

Page 19: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Cholesky  Decomposition

p Use  Futures  &  Distributed  Arrayp  Input  Size:  10,000x10,000  

n  Tile  Size:  500x50019

0001122233

001122233

01122233

1122233

122233

22233

2233

233

33 3

20  1les  

dependencies  

20  1les  

User  Defined  Distribu1on  

Page 20: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Cholesky  Result

20

Lower  is  beLer  

2401.32  

941.70  730.94  

2781.12  

1105.38  902.86  858.77  

283.32   216.48  

0.00  

500.00  

1000.00  

1500.00  

2000.00  

2500.00  

3000.00  

8  locales   16  locales   32  locales  

Execu3

on  Tim

e  (sec)  

Number  of  Locales  

C-­‐Struct  

LLVM  w/o  wopt  

LLVM  w/  wopt  

1

2 3

vs.  vs.   Overhead  of  introducing  LLVM  +  packed  pointer  (1.2x  slower)    

123

__  vs.  

Performance  improvement  by  LLVM  opt  (4.2x  faster)    

LLVM+wide  opt  is  faster  than  the  conven1onal  C-­‐Struct  (3.4x)  

Page 21: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Cholesky  Analysis

21

C-­‐Struct   LLVM  w/o  wopt   LLVM  w/  wopt  

1.78E+09   1.97E+09   5.89E+08  

Dynamic  number  of  Chapel  PUT/GET  APIs  actually  executed  (2  Locales):  Obtained  with  1,000  x  1,000  input  (100x100  1le  size)    

//  C-­‐Struct,  LLVM  w/o  wopt  for  jB  in  zero..1leSize-­‐1  do  {      for  kB  in  zero..1leSize-­‐1  do  {          4GETS          for  iB  in  zero..1leSize-­‐1  do  {              8GETS  (+1  GETS  w/  LLVM)              1PUT          }}}  

//  LLVM  w/  wopt  for  jB  in  zero..1leSize-­‐1  do  {      1GET      for  kB  in  zero..1leSize-­‐1  do  {          3GETS          for  iB  in  zero..1leSize-­‐1  do  {              2GETS              1PUT          }}}  

Page 22: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Smithwaterman

p Use  Futures  &  Distributed  Arrayp  Input  Size:  185,500x192,000  

n  Tile  Size:  11,600x12,000   22

02

13

16  1les  

16  1les  

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

02

13

Cyclic  Distribu1on  dependencies  

Page 23: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Smithwaterman  Result

23

381.23   379.01  

1260.31   1263.76  

626.38   635.45  

0.00  

200.00  

400.00  

600.00  

800.00  

1000.00  

1200.00  

1400.00  

8  locales   16  locales  

Execu3

on  Tim

e  (sec)  

Number  of  Locales  

C-­‐Struct  

LLVM  w/o  wopt  

LLVM  w/  wopt  

Lower  is  beLer  

1

2 3

vs.  vs.   Overhead  of  introducing  LLVM  +  packed  pointer  (3.3x  slower)    

123

__  vs.  

Performance  improvement  by  LLVM  opt  (2.0x  faster)    

LLVM+wide  opt  is  slower  than  the  conven1onal  C-­‐Struct  (0.6x)  

Page 24: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Smithwaterman  Analysis

24

C-­‐Struct   LLVM  w/o  wopt   LLVM  w/  wopt  

1.41E+08   1.41E+08   5.26E+07  

Dynamic  number  of  Chapel  PUT/GET  APIs  actually  executed  (1  Locale):  Obtained  with  1,856  x  1,920  input  (232x240  1le  size)    

//  C-­‐Struct,  LLVM  w/o  wopt  for  (ii,  jj)  in  1le_1_2d_domain  {          33  GETS          1  PUTS  }  

//  LLVM  w/  wopt  for  (ii,  jj)  in  1le_1_2d_domain  {          12  GETS          1  PUTS  }    

No  LICM  though  there  are  opportuni3es  

Page 25: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Key  Insightsp Using  address  space  100  offers  finer-‐‑‒grain  optimization  opportunities  (e.g.  Chapel  Array)

25

for  i  in  {1..N}  {      data  =  A(i);  }  

for  i  in  1..N  {          head  =  GET(pointer  to  array  head)          offset1  =  GET(offset)          data  =  GET(head+i*offset1)  }  

Opportuni1es  for    1.LICM  

2.Aggrega1on  

Page 26: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Conclusionsp The  first  performance  evaluation  and  analysis  of  LLVM-‐‑‒based  Chapel  compilern Capable  of  utilizing  the  existing  optimizations  passes  even  for  remote  data  (e.g.  LICM)  p Removes  significant  number  of  Comm  APIs

n LLVM  w/  opt  is  always  better  than  LLVM  w/o  optn Stream-‐‑‒EP,  Cholesky  

p LLVM-‐‑‒based  code  generation  is  faster  than  C-‐‑‒based  code  generation  (1.04x,  3.4x)

n Smithwaterman  p LLVM-‐‑‒based  code  generation  is  slower  than  C-‐‑‒based  code  generation    due  to  constraints  of  address  space  feature  in  LLVM

p No  LICM  though  there  are  opportunities  p Significant  overhead  of  Packed  Wide  pointer

26

Page 27: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Future  Workp Evaluate  other  applications

p Regular  applicationsp Irregular  applications

p Possibly-‐‑‒Remote  to  Definitely-‐‑‒Local  transformation  by  compiler

p PIR  in  LLVM27

local { A(i) = … } // hint by programmmer … = A(i); // Definitely Local

on Locales[1] { // hint by programmer var A: [D] int; // Definitely Local

Page 28: LLVM Optimizationsfor PGAS Programs · uniformly optimize PGAS Programs! ... Thistalk discussesthe prosand consof LLVM based communication optimizationsfor Chapel ... optimizations

Acknowledgementsp Special  thanks  to

n Brad  Chamberlain  (Cray)n Rafael  Larrosa  Jimenez  (UMA)n Rafael  Asenjo  Plaza  (UMA)n Shams  Imam  (Rice)n Sagnak  Tasirlar  (Rice)n Jun  Shirako  (Rice)

28