37
LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC1 1

LLVM-based Communication Optimizations for PGAS Programs

Embed Size (px)

Citation preview

Page 1: LLVM-based Communication Optimizations for PGAS Programs

1

LLVM-based Communication Optimizations

for PGAS ProgramsAkihiro Hayashi (Rice University)

Jisheng Zhao (Rice University)Michael Ferguson (Cray Inc.)

Vivek Sarkar (Rice University)

2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15

Page 2: LLVM-based Communication Optimizations for PGAS Programs

2

A Big Picture

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

© RIKEN AICS

©Berkeley Lab.

X10,Habanero-UPC+

+,…

Communicatio

n

Optimization

Page 3: LLVM-based Communication Optimizations for PGAS Programs

3

PGAS LanguagesHigh-productivity features:

Global-View Task parallelism Data Distribution Synchronization

X10Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/

CAFHabanero-UPC++

Page 4: LLVM-based Communication Optimizations for PGAS Programs

4

Communication is implicit in some PGAS Programming ModelsGlobal Address Space

Compiler and Runtime is responsible for performing communications across nodes

1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS4: }

Remote Data Access in Chapel

Page 5: LLVM-based Communication Optimizations for PGAS Programs

5

Communication is Implicit in some PGAS Programming Models (Cont’d)

1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS

Remote Data Access

if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…);}

Runtime affinity handling

1: var x = 1; 2: on Locales[1] {3: … = 1;

Compiler Optimization

OR

Page 6: LLVM-based Communication Optimizations for PGAS Programs

6

Communication Optimizationis Important

A synthetic Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband

4096

4096

081

920

1228

80

1638

40

2048

00

2457

60

2867

20

3276

80

3686

40

4096

00

4505

60

4915

20

5324

80

6144

00

6553

60

6963

20

7372

80

7782

40

8192

00

8601

60

9011

20

9461

76

1064

960

100

1000

10000

100000

1000000

10000000 Optimized (Bulk Transfer)Unoptimized

Transferred Byte

Late

ncy

(ms)

1,500x

59x

Lower is better

Page 7: LLVM-based Communication Optimizations for PGAS Programs

7

PGAS Optimizations are language-specific

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

X10,Habanero-UPC+

+,…

© Argonne National Lab.

©Berkeley Lab.

© RIKEN AICS

Chapel Compiler

UPC Compiler

X10 CompilerHabanero-C

Compiler

Language

Specific!

Page 8: LLVM-based Communication Optimizations for PGAS Programs

8

Our goal

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

© RIKEN AICS

©Berkeley Lab.

X10,Habanero-UPC+

+,…

Page 9: LLVM-based Communication Optimizations for PGAS Programs

9

Why LLVM?Widely used language-agnostic compiler

LLVM Intermediate Representation (LLVM IR)

C/C++FrontendClang

C/C++, Fortran, Ada, Objective-C

Frontenddragonegg

ChapelFrontend

x86 backend

Power PCbackend

ARMbackend

PTXbackend

Analysis & Optimizations

UPC++Frontend

x86 Binary PPC Binary ARM Binary GPU Binary

Page 10: LLVM-based Communication Optimizations for PGAS Programs

10

Summary & Contributions

Our Observations : Many PGAS languages share semantically similar constructs PGAS Optimizations are language-specific

Contributions: Built a compilation framework that can uniformly optimize

PGAS programs (Initial Focus : Communication)Enabling existing LLVM passes for communication optimizationsPGAS-aware communication optimizations

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,

Page 11: LLVM-based Communication Optimizations for PGAS Programs

11

Overview of our frameworkChapel

Programs

UPC++Program

sX10

Programs

Chapel-LLVM

frontendUPC++-

LLVM frontend

X10-LLVM

frontend

LLVM IRLLVM-basedCommunicati

on Optimization

passes

Lowering Pass

CAFProgram

s

CAF-LLVM

frontend 1. Vanilla LLVM IR2. use address space feature to express communications

Need to be implemented when supporting a new

language/runtimeGenerally language-agnostic

Page 12: LLVM-based Communication Optimizations for PGAS Programs

12

How optimizations work

store i64 1, i64 addrspace(100)* %x, …

// x is possibly remotex = 1;

Chapelshared_var<int> x;x = 1;

UPC++

1.Existing LLVMOptimizations

2.PGAS-awareOptimizations

treat remote

access as if it were

local access Runtime-Specific Lowering

Address space-aware

Optimizations

Communication API Calls

Page 13: LLVM-based Communication Optimizations for PGAS Programs

13

LLVM-based Communication Optimizations for Chapel

1. Enabling Existing LLVM passes Loop invariant code motion (LICM) Scalar replacement, …

2. Aggregation Combine sequences of loads/stores on

adjacent memory location into a single memcpy

These are already implemented in the standard Chapel compiler

Page 14: LLVM-based Communication Optimizations for PGAS Programs

14

An optimization example:LICM for Communication Optimizations

LICM = Loop Invariant Code Motion

for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x;}

LICM by LLVM

Page 15: LLVM-based Communication Optimizations for PGAS Programs

15

An optimization example:Aggregation

// p is possibly remotesum = p.x + p.y;

llvm.memcpy(…);

load i64 addrspace(100)* %pptr+0load i64 addrspace(100)* %pptr+4

x y

GET GET

GET

Page 16: LLVM-based Communication Optimizations for PGAS Programs

16

LLVM-based Communication Optimizations for Chapel

3. Locality Optimization Infer the locality of data and convert

possibly-remote access to definitely-local access at compile-time if possible

4. Coalescing Remote array access vectorization

These are implemented, but not in the standard Chapel compiler

Page 17: LLVM-based Communication Optimizations for PGAS Programs

17

An Optimization example:Locality Optimization

1: proc habanero(ref x, ref y, ref z) {2: var p: int = 0;3: var A:[1..N] int;4: local { p = z; }5: z = A(0) + z;6:}

2.p and z are definitely local

3.Definitely-local access!(avoid runtime affinity checking)

1.A is definitely-local

Page 18: LLVM-based Communication Optimizations for PGAS Programs

18

An Optimization example:Coalescing

1:for i in 1..N {2: … = A(i);3:}

1:localA = A;2:for i in 1..N {3: … = localA(i);4:}

Perform bulk transfer

Converted to definitely-local

access

BeforeAfter

Page 19: LLVM-based Communication Optimizations for PGAS Programs

19

Performance Evaluations:Benchmarks

Application SizeSmith-Waterman 185,600 x 192,000Cholesky Decomp 10,000 x 10,000

NPB EP CLASS = D

Sobel 48,000 x 48,000SSCA2 Kernel 4 SCALE = 16

Stream EP 2^30

Page 20: LLVM-based Communication Optimizations for PGAS Programs

20

Performance Evaluations:Platforms Cray XC30™ Supercomputer @ NERSC

Node Intel Xeon E5-2695 @ 2.40GHz x 24 cores 64GB of RAM

Interconnect Cray Aries interconnect with Dragonfly topology

Westmere Cluster @ Rice Node

Intel Xeon CPU X5660 @ 2.80GHz x 12 cores 48 GB of RAM

Interconnect Quad-data rated infiniband

Page 21: LLVM-based Communication Optimizations for PGAS Programs

21

Performance Evaluations:Details of Compiler & Runtime Compiler

Chapel Compiler version 1.9.0 LLVM 3.3

Runtime : GASNet-1.22.0

Cray XC : ariesWestmere Cluster : ibv-conduit

Qthreads-1.10Cray XC: 2 shepherds, 24 workers / shepherdWestmere Cluster : 2 shepherds, 6 workers / shepherd

Page 22: LLVM-based Communication Optimizations for PGAS Programs

22

BRIEF SUMMARY OF PERFORMANCE EVALUATIONS

Performance Evaluation

Page 23: LLVM-based Communication Optimizations for PGAS Programs

23

SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing

Locality OptAggregation

Perf

orm

ance

Im-

prov

emen

t ov

er L

LVM

-uno

ptResults on the Cray XC (LLVM-unopt vs. LLVM-allopt)

4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

2.1x

19.5x

1.1x

2.4x1.4x 1.3x

Higher is better

Page 24: LLVM-based Communication Optimizations for PGAS Programs

24

SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing

Locality OptAggregation

Perf

orm

ance

Im-

prov

emen

t ov

er L

LVM

-uno

ptResults on Westmere Cluster(LLVM-unopt vs. LLVM-allopt)

2.3x

16.9x

1.1x

2.5x1.3x

2.3x

4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

Page 25: LLVM-based Communication Optimizations for PGAS Programs

25

DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION

Performance Evaluation

Page 26: LLVM-based Communication Optimizations for PGAS Programs

26

Cholesky Decompositiondependencies0

001122233

001122233

01122233

1122233

122233

22233

2233

233

33 3

Node0

Node1

Node2

Node3

Page 27: LLVM-based Communication Optimizations for PGAS Programs

27

Metrics1. Performance & Scalability

Baseline (LLVM-unopt) LLVM-based Optimizations (LLVM-allopt)

2. The dynamic number of communication API calls

3. Analysis of optimized code4. Performance comparison

Conventional C-backend vs. LLVM-backend

Page 28: LLVM-based Communication Optimizations for PGAS Programs

28

Performance Improvement by LLVM (Cholesky on the Cray XC)

1 locale 2 locales 4 locales 8 locales 16 locales 32 locales0

1

2

3

4

5

1

0.1 0.1 0.2 0.2 0.3

2.6 2.7

3.74.1 4.3 4.5LLVM-unopt LLVM-allopt

Spee

dup

over

LLV

M-

unop

t 1l

ocal

e

LLVM-based communication optimizations show scalability

Page 29: LLVM-based Communication Optimizations for PGAS Programs

29

Communication API calls elimination by LLVM (Cholesky on the Cray XC)

LOCAL_ GET REMOTE_GET LOCAL_PUT REMOTE_PUT0

0.2

0.4

0.6

0.8

1

1.2100.0% 100.0% 100.0% 100.0%

12.1%0.2%

89.2%100.0%

LLVM-unopt LLVM-allopt

Dyna

mic

num

ber o

f co

mm

unic

atio

n AP

I cal

ls

(nor

mal

ized

to

LLVM

-un

opt) 8.3x

improve-ment

500ximprove-

ment

1.1ximprove-

ment

Page 30: LLVM-based Communication Optimizations for PGAS Programs

30

Analysis of optimized code

for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT}}}

1.ALLOCATE LOCAL BUFFER2.PERFORM BULK TRANSFERfor jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT}}}

LLVM-unopt LLVM-allopt

Page 31: LLVM-based Communication Optimizations for PGAS Programs

31

Performance comparison with C-backend

1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales0

0.51

1.52

2.53

3.54

4.55

2.8

0.1 0.1 0.1 0.30.7

0.41

0.1 0.1 0.2 0.2 0.3 0.4

2.6 2.7

3.74.1 4.3 4.5 4.5

C-backend LLVM-unopt LLVM-allopt

Spee

dup

over

LLV

M-

unop

t 1l

ocal

e

C-backend is faster!

Page 32: LLVM-based Communication Optimizations for PGAS Programs

32

Current limitation

In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces

Locale(16bit)

addr(48bit)

For LLVM Code Generation : 64bit packed pointer

For C Code Generation : 128bit struct pointer

ptr.locale;ptr.addr;

ptr >> 48 ptr | 48BITS_MASK;

1. Needs more instructions2. Lose opportunities for Alias analysis

Page 33: LLVM-based Communication Optimizations for PGAS Programs

33

ConclusionsLLVM-based Communication optimizations for

PGAS Programs Promising way to optimize PGAS programs in a

language-agnostic manner Preliminary Evaluation with 6 Chapel applications

Cray XC30 Supercomputer– 4.6x average performance improvement

Westmere Cluster– 4.4x average performance improvement

Page 34: LLVM-based Communication Optimizations for PGAS Programs

34

Future workExtend LLVM IR to support parallel programs

with PGAS and explicit task parallelism Higher-level IR

Parallel Programs

(Chapel, X10, CAF, HC, …)

1.RI-PIR Gen2.Analysis

3.Transformation

1.RS-PIR Gen2.Analysis

3.TransformationLLVM

Runtime-IndependentOptimizations

e.g. Task Parallel Construct

LLVM Runtime-Specific

Optimizationse.g. GASNet API

Binary

Page 35: LLVM-based Communication Optimizations for PGAS Programs

35

AcknowledgementsSpecial thanks to

Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Habanero Group at Rice

Page 36: LLVM-based Communication Optimizations for PGAS Programs

36

Backup slides

Page 37: LLVM-based Communication Optimizations for PGAS Programs

37

Compilation FlowChapel

Programs

AST Generation and Optimizations

C-code Generation

LLVM OptimizationsBackend Compiler’s

Optimizations(e.g. gcc –O3)

LLVM IRC Programs

LLVM IR Generation

Binary Binary