LLVM-based Communication Optimizations for PGAS Programs

1

LLVM-based Communication Optimizations

for PGAS ProgramsAkihiro Hayashi (Rice University)

Jisheng Zhao (Rice University)Michael Ferguson (Cray Inc.)

Vivek Sarkar (Rice University)

2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15

2

A Big Picture

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

© RIKEN AICS

©Berkeley Lab.

X10,Habanero-UPC+

+,…

Communicatio

n

Optimization

3

PGAS LanguagesHigh-productivity features:

Global-View Task parallelism Data Distribution Synchronization

X10Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/

CAFHabanero-UPC++

4

Communication is implicit in some PGAS Programming ModelsGlobal Address Space

Compiler and Runtime is responsible for performing communications across nodes

1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS4: }

Remote Data Access in Chapel

5

Communication is Implicit in some PGAS Programming Models (Cont’d)

1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS

Remote Data Access

if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…);}

Runtime affinity handling

1: var x = 1; 2: on Locales[1] {3: … = 1;

Compiler Optimization

OR

6

Communication Optimizationis Important

A synthetic Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband

4096

4096

081

920

1228

80

1638

40

2048

00

2457

60

2867

20

3276

80

3686

40

4096

00

4505

60

4915

20

5324

80

6144

00

6553

60

6963

20

7372

80

7782

40

8192

00

8601

60

9011

20

9461

76

1064

960

100

1000

10000

100000

1000000

10000000 Optimized (Bulk Transfer)Unoptimized

Transferred Byte

Late

ncy

(ms)

1,500x

59x

Lower is better

7

PGAS Optimizations are language-specific


X10,Habanero-UPC+

+,…


©Berkeley Lab.

© RIKEN AICS

Chapel Compiler

UPC Compiler

X10 CompilerHabanero-C

Compiler

Language

Specific!

8

Our goal



© RIKEN AICS

©Berkeley Lab.

X10,Habanero-UPC+

+,…

9

Why LLVM?Widely used language-agnostic compiler

LLVM Intermediate Representation (LLVM IR)

C/C++FrontendClang

C/C++, Fortran, Ada, Objective-C

Frontenddragonegg

ChapelFrontend

x86 backend

Power PCbackend

ARMbackend

PTXbackend

Analysis & Optimizations

UPC++Frontend

x86 Binary PPC Binary ARM Binary GPU Binary

10

Summary & Contributions

Our Observations : Many PGAS languages share semantically similar constructs PGAS Optimizations are language-specific

Contributions: Built a compilation framework that can uniformly optimize

PGAS programs (Initial Focus : Communication)Enabling existing LLVM passes for communication optimizationsPGAS-aware communication optimizations

Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,

11

Overview of our frameworkChapel

Programs

UPC++Program

sX10

Programs

Chapel-LLVM

frontendUPC++-

LLVM frontend

X10-LLVM

frontend

LLVM IRLLVM-basedCommunicati

on Optimization

passes

Lowering Pass

CAFProgram

s

CAF-LLVM

frontend 1. Vanilla LLVM IR2. use address space feature to express communications

Need to be implemented when supporting a new

language/runtimeGenerally language-agnostic

12

How optimizations work

store i64 1, i64 addrspace(100)* %x, …

// x is possibly remotex = 1;

Chapelshared_var<int> x;x = 1;

UPC++

1.Existing LLVMOptimizations

2.PGAS-awareOptimizations

treat remote

access as if it were

local access Runtime-Specific Lowering

Address space-aware

Optimizations

Communication API Calls

13

LLVM-based Communication Optimizations for Chapel

1. Enabling Existing LLVM passes Loop invariant code motion (LICM) Scalar replacement, …

2. Aggregation Combine sequences of loads/stores on

adjacent memory location into a single memcpy

These are already implemented in the standard Chapel compiler

14

An optimization example:LICM for Communication Optimizations

LICM = Loop Invariant Code Motion

for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x;}

LICM by LLVM

15

An optimization example:Aggregation

// p is possibly remotesum = p.x + p.y;

llvm.memcpy(…);

load i64 addrspace(100)* %pptr+0load i64 addrspace(100)* %pptr+4

x y

GET GET

GET

16

LLVM-based Communication Optimizations for Chapel

3. Locality Optimization Infer the locality of data and convert

possibly-remote access to definitely-local access at compile-time if possible

4. Coalescing Remote array access vectorization

These are implemented, but not in the standard Chapel compiler

17

An Optimization example:Locality Optimization

1: proc habanero(ref x, ref y, ref z) {2: var p: int = 0;3: var A:[1..N] int;4: local { p = z; }5: z = A(0) + z;6:}

2.p and z are definitely local

3.Definitely-local access!(avoid runtime affinity checking)

1.A is definitely-local

18

An Optimization example:Coalescing

1:for i in 1..N {2: … = A(i);3:}

1:localA = A;2:for i in 1..N {3: … = localA(i);4:}

Perform bulk transfer

Converted to definitely-local

access

BeforeAfter

19

Performance Evaluations:Benchmarks

Application SizeSmith-Waterman 185,600 x 192,000Cholesky Decomp 10,000 x 10,000

NPB EP CLASS = D

Sobel 48,000 x 48,000SSCA2 Kernel 4 SCALE = 16

Stream EP 2^30

20

Performance Evaluations:Platforms Cray XC30™ Supercomputer @ NERSC

Node Intel Xeon E5-2695 @ 2.40GHz x 24 cores 64GB of RAM

Interconnect Cray Aries interconnect with Dragonfly topology

Westmere Cluster @ Rice Node

Intel Xeon CPU X5660 @ 2.80GHz x 12 cores 48 GB of RAM

Interconnect Quad-data rated infiniband

21

Performance Evaluations:Details of Compiler & Runtime Compiler

Chapel Compiler version 1.9.0 LLVM 3.3

Runtime : GASNet-1.22.0

Cray XC : ariesWestmere Cluster : ibv-conduit

Qthreads-1.10Cray XC: 2 shepherds, 24 workers / shepherdWestmere Cluster : 2 shepherds, 6 workers / shepherd

22

BRIEF SUMMARY OF PERFORMANCE EVALUATIONS

Performance Evaluation

23

SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing

Locality OptAggregation

Perf

orm

ance

Im-

prov

emen

t ov

er L

LVM

-uno

ptResults on the Cray XC (LLVM-unopt vs. LLVM-allopt)

4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

2.1x

19.5x

1.1x

2.4x1.4x 1.3x

Higher is better

24

SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing

Locality OptAggregation

Perf

orm

ance

Im-

prov

emen

t ov

er L

LVM

-uno

ptResults on Westmere Cluster(LLVM-unopt vs. LLVM-allopt)

2.3x

16.9x

1.1x

2.5x1.3x

2.3x

4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

25

DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION

Performance Evaluation

26

Cholesky Decompositiondependencies0

001122233

001122233

01122233

1122233

122233

22233

2233

233

33 3

Node0

Node1

Node2

Node3

27

Metrics1. Performance & Scalability

Baseline (LLVM-unopt) LLVM-based Optimizations (LLVM-allopt)

2. The dynamic number of communication API calls

3. Analysis of optimized code4. Performance comparison

Conventional C-backend vs. LLVM-backend

28

Performance Improvement by LLVM (Cholesky on the Cray XC)

1 locale 2 locales 4 locales 8 locales 16 locales 32 locales0

1

2

3

4

5

1

0.1 0.1 0.2 0.2 0.3

2.6 2.7

3.74.1 4.3 4.5LLVM-unopt LLVM-allopt

Spee

dup

over

LLV

M-

unop

t 1l

ocal

e

LLVM-based communication optimizations show scalability

29

Communication API calls elimination by LLVM (Cholesky on the Cray XC)

LOCAL＿ GET REMOTE_GET LOCAL_PUT REMOTE_PUT0

0.2

0.4

0.6

0.8

1

1.2100.0% 100.0% 100.0% 100.0%

12.1%0.2%

89.2%100.0%

LLVM-unopt LLVM-allopt

Dyna

mic

num

ber o

f co

mm

unic

atio

n AP

I cal

ls

(nor

mal

ized

to

LLVM

-un

opt) 8.3x

improve-ment

500ximprove-

ment

1.1ximprove-

ment

30

Analysis of optimized code

for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT}}}

1.ALLOCATE LOCAL BUFFER2.PERFORM BULK TRANSFERfor jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT}}}

LLVM-unopt LLVM-allopt

31

Performance comparison with C-backend

1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales0

0.51

1.52

2.53

3.54

4.55

2.8

0.1 0.1 0.1 0.30.7

0.41

0.1 0.1 0.2 0.2 0.3 0.4

2.6 2.7

3.74.1 4.3 4.5 4.5

C-backend LLVM-unopt LLVM-allopt

Spee

dup

over

LLV

M-

unop

t 1l

ocal

e

C-backend is faster!

32

Current limitation

In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces

Locale(16bit)

addr(48bit)

For LLVM Code Generation : 64bit packed pointer

For C Code Generation : 128bit struct pointer

ptr.locale;ptr.addr;

ptr >> 48 ptr | 48BITS_MASK;

1. Needs more instructions2. Lose opportunities for Alias analysis

33

ConclusionsLLVM-based Communication optimizations for

PGAS Programs Promising way to optimize PGAS programs in a

language-agnostic manner Preliminary Evaluation with 6 Chapel applications

Cray XC30 Supercomputer– 4.6x average performance improvement

Westmere Cluster– 4.4x average performance improvement

34

Future workExtend LLVM IR to support parallel programs

with PGAS and explicit task parallelism Higher-level IR

Parallel Programs

(Chapel, X10, CAF, HC, …)

1.RI-PIR Gen2.Analysis

3.Transformation

1.RS-PIR Gen2.Analysis

3.TransformationLLVM

Runtime-IndependentOptimizations

e.g. Task Parallel Construct

LLVM Runtime-Specific

Optimizationse.g. GASNet API

Binary

35

AcknowledgementsSpecial thanks to

Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Habanero Group at Rice

36

Backup slides

37

Compilation FlowChapel

Programs

AST Generation and Optimizations

C-code Generation

LLVM OptimizationsBackend Compiler’s

Optimizations(e.g. gcc –O3)

LLVM IRC Programs

LLVM IR Generation

Binary Binary

Technology

LLVM-based Communication Optimizations for PGAS Programs