42
1 ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres S. Charif-Rubial UVSQ/ECR 11th Parallel Tools Workshop 11/09/2017 ASSIST

ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

1

ASSIST: A Feedback-Directed Optimization

source to source transformation tool for

HPC applications

William Jalby, Y. Lebras, Andres S. Charif-Rubial

UVSQ/ECR

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 2: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

2

Outline

1. Introduction: motivation, goals

2. ASSIST

• Requirements

• Implementation & Design

• Available Transformations

3. Examples and Experimental Results

• ASSIST PGO Versus Intel PGO

• Other Transformations Apply to Real Applications

4. Conclusion

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 3: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

3

I - INTRODUCTION

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 4: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

4

Motivations

Combine source level knowledge and static/dynamic performance analyses is very attractive to perform accurate performance diagnostic

Source code V.S. actual executed code

Better understand memory related issues (dependencies, array accesses)

The Feedback Directed Optimization (FDO)/ Profile Guided Optimizations (PGO), are well known optimization approach used by compiler with its but…

Lack of information of what is really done

Limited in performance information used (loop trip count, branch behavior)

Limited in transformation power

Cannot be configured by the user

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 5: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

5

Goals

Basic idea: MAQAO is pretty good at performance problems diagnostic, we need to go further and fix performance issues.

ASSIST an “Auto-tuning” framework: for us, auto tuning essentially means fully automated

Exploiting MAQAO’s metrics & knowledge

Detecting & exploiting information from source code

Transformation driven framework: ideally dtect whether a transformation is beneficial or not

Full control on transformations

Help developers to maintain their code

Ensure portability

Ease code refactoring (e.g. change data types across a program)

Provide users with a mean to provide extra information that cannot be encoded in the program (i.e. programming language limitations)

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 6: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

6

II - ASSIST

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 7: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

7

Implementation & Design

11th Parallel Tools Workshop – 11/09/2017

Optimization Process

ASSIST

Page 8: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

8

Requirements

Compiler infrastructure requirement

Allowing to manipulate the Abstract Syntax Tree (AST)

Performing source-to-source

Handling Fortran, C and C++ languages

The Rose Compiler

Meeting all these criteria

Robust to these languages

No equivalent when we started

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 9: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

9

Implementation & Design

ASSIST: Automatic Source-to-Source assISTant

Support the following input languages

• Fortran 77, 90, 95, 2003 / C / C++03

Readable output

• Special effort on indentation and spaces

Easy to use with a simple user interface

• Annotations

• Configuration file

Target audience

• User with the ability to modify/annotate the code

• Application developers

Integrated as a MAQAO Module

• Take advantage of the interconnection between the core (binary manipulation and analysis layers) and the modules

• Use the modules’ output to perform transformation(s)

• Extend MAQAO to source code manipulation

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 10: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

10

Available Transformations

Three types of transformations

User Interface

Annotations – Source code annotation

Configuration file – Describing line per line which transformation performed on which statement

11th Parallel Tools Workshop – 11/09/2017

ASSIST

AST Modifier

• Unroll

• Full unroll

• Interchange

• Strip mining

• Tilling

• Loop/Function Specialization

Directive(s) insertion

• Loop count (involving dynamic analyses)

Mix of both

• Block Vectorization

Page 11: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

11

Transformations

Specialization

Transformation of type : AST Modifier

Specialization of integer parameters provides to the compiler optimizations opportunities

• Constant propagation

• Partial Dead Code Elimination

• Loop unrolling, tiling, block vectorization, etc

Single values or ranges can be defined

Two distinct cases

• Loop specialization

• Function specialization

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 12: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

12

Transformations

Loop Specialization Example

• Set bounds

• Conservatives : keep a generic version

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 13: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

13

Transformations

Function Specialization

• Partial Dead Code Elimination

• More information to perform another transformation

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 14: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

14

Transformations

Loop count

Loop oriented transformation of type : Directives insertion

Loop count knowledge enables the compiler to perform optimizations

• The compiler cannot always guess the loop trip count at compile time => it may refuse to vectorize

• Most of time simplifies

The control flow (less loop versions)

The choice of the vectorization/unrolling

Requires the dynamic feedback

Performed by VPROF (MAQAO module)

Returns the number of iterations of loops (min, max & average)

Limitation

• Loops’ bounds are dataset dependent

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 15: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

15

Example

Dynamic feedback example

Original loop

Extract of VPROF’s output

Exploiting the feedback

Return a file with corresponding directives

11th Parallel Tools Workshop – 11/09/2017

ASSIST

maqao s2s \

-vprof_xp=/home/ylebras/vprof_dir/vprof.csv \

-bin=/home/ylebras/NBP3.3.1/NPB3.3.1-SER/bin/is.B.x

for (i=0; i < NUM_KEYS; i++)

key_buff_ptr[key_buff_ptr2[i]]++;

#pragma loop_count max=134217728, 134217728, avg=134217728

for (i=0; i < NUM_KEYS; i++) {

key_buff_ptr[key_buff_ptr2[i]]++;

}

Page 16: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

16

Transformations

Block Vectorization

Loop oriented transformation of type : Directives insertion & AST modifier

Performing a loop decomposition increase the vectorization ratio

Increasing the vectorization ratio by :

• Forcing the vectorization (“SIMD” Directive)

• Avoiding dynamic or static loop peeling transformation (use of UNALIGNED PRAGMA)

If the loop bound is not known at compile time

• The loop will be specialized by checking the modulo of a given input

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Loop not

vectorized

by the

compiler

Target: AVX2

Body: DP

Page 17: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

17

Transformations

Block Vectorization

Loop oriented transformation of type : Directives insertion & AST modifier

Performing a loop decomposition increase the vectorization ratio

Increasing the vectorization ratio by :

• Forcing the vectorization (“SIMD” Directive)

• Avoiding dynamic or static loop peeling transformation

If the loop bound is not known at compile time

• The loop will be specialized by checking the modulo of a given input

11th Parallel Tools Workshop – 11/09/2017

Loop

decomposition

Residual

ASSIST

Loop not

vectorized

by the

compiler

Page 18: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

18

Example

Example of the block vectorization performed in AVBP (target architecture : Skylake)

Original loop

Extract of CQA’s output

11th Parallel Tools Workshop – 11/09/2017

ASSIST

In this case, “nproduct” is often called with the value “3”

Exploiting the CQA feedback

Page 19: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

19

Example

Example of the block vectorization performed in AVBP (target architecture : Skylake using AVX2)

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Step 1 –

Specialization of

the loop

Step 2 –

Apply block

vectorization

Keep a generic

version of the

code

Page 20: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

20

Results

CQA report before and after block vectorization

11th Parallel Tools Workshop – 11/09/2017

Before

The loop is partially

vectorized

(33% of SSE/AVX

instructions are used

in vector mode) : Only

50% of vector length is

used.

33% of SEE/AVX loads

are used in vector

mode.

33% of SSE/AVX stores

are used in vector mode

After

Loop is vectorized

(all SSE/AVX

instructions are

used in vector

mode) but on 75%

vector length.

ASSIST

Page 21: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

21

Transformations

Configuration file sample

• File: Source file path

• Arch: Architectures to support.

• Target a loop by its line number or by a label attached on the loop

A way to annotate an application without add directives in the source code

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 22: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

22

III – Experimental Results

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 23: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

23

Results

Test cases

NPB-3.3.1-SER (Fortran77/C) https://www.nas.nasa.gov/publications/npb.html

• NAS Parallel Benchmarks

Applications

AVBP (Fortran95) http://www.cerfacs.fr/avbp7x/

• A parallel CFD code that solves the three-dimensional compressible Navier-Stokes on unstructured and hybrid grids

Yales2 (Fortran2003) https://www.coria-cfd.fr/index.php/YALES2

• YALES2 aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes

Warp3D (Fortran77) http://www.warp3d.net/

• A research code for the solution of large-scale, 3-D solid models subjected to static and dynamic loads

ABINIT (Fortran90) https://www.abinit.org

• ABINIT is a software suite to calculate the optical, mechanical, vibrational, and other observable properties of materials

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 24: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

24

Results

Experimental setup

Compiled with icc17.0.4

Intel Skylake (Intel® Xeon® Platinum 8170 CPU@2,10GHz)

Multiple (around 30) executions to be statiscally meaning full and avoid outliers

PGO performance comparison

Original version

ICC’s PGO

ASSIST’s PGO like (loop count transformation)

Results of other transformations

Block Vectorization

Specialization

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 25: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

25

Results on NAS

Speedups with the ICC’s PGO versus loop count transformation compared to the original version

11th Parallel Tools Workshop – 11/09/2017

Number of loops processed with loop

count transformation

BT.B 34

CG.B 11

DC.B 5

EP.B 2

FT.B 6

IS.B 14

LU.B 49

MG.B 18

SP.B 79

UA.B 80

ASSIST

Not

significant

results

Many loop bounds

have been hard coded

Page 26: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

26

Results on AVBP, Yales2 & Warp3D

Speedups with the ICC’s PGO versus loop count transformation compared to the original version

11th Parallel Tools Workshop – 11/09/2017

number of loops processed

with loop count transformation

1D_COFFE 122

3D_Cylinder 162

SIMPLE 158

NASA 149

test_68 57

Original

version

Page 27: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

27

Results on AVBP(model = SIMPLE)

Speedup by function before and after applying function/loop specialization an block vectorization

11th Parallel Tools Workshop – 11/09/2017

Original

version

Page 28: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

28

Results on AVBP(model = SIMPLE)

Execution time by function before and after applying function/loop specialization an block vectorization

11th Parallel Tools Workshop – 11/09/2017

Page 29: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

29

Results on ABINIT(Ti-256)

Speedup with function specialization + tiling versus only specialization versus ICC’s PGO compared to the original version

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Time (sec) Speedup

Original version

1,14 1,00

icc's PGO 1,14 1,00

ASSIST Spe

1,1 1,04

ASSIST Spe+Tilling

0,65 1,75

Page 30: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

30

IV - Conclusion

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 31: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

31

Conclusion

A framework performing selective source-to-source transformations/optimizations guided by static/dynamic performance analysis.

An open source FDO tool

• Harnessing static and dynamic analyses from MAQAO

• Defining transformations on a per architecture basis either automatically or by the user

• Transformations done directly or by pragmas

Encouraging results

• Using the loop count transformation alone is already competitive with Intel’s PGO

• Block vectorization only needs a static analysis of the binary and provides significant speedup when the compiler failed to vectorize efficiently

• Automatic specialization allows to gain in maintainability and performance

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 32: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

32

Future work

Enhance our FDO tool

• Keep working on function/loop specialization, from annotation and automatic using feedback from MAQAO tools

• Use more data from dynamic feedback (hardware counters, static analyses)

• Enable the tool to launch MAQAO modules (autotuning mode) based on the detected opportunities

Unified view of source and binary level analyses

• Help application developers understand the gap between how the code should run and how it actually performs

Continue to work with our application developer partners on code maintainability features

Keep on adding other transformations based on MAQAO’s research work to detect more optimization opportunities

• Use multiple dataset as input

• Detect values for specialization

• …

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 33: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

33

Thanks

Any question ?

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 34: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

34

Requirements

Find a compiler infrastructure allowing to perform source-to-source transformations handling Fortran, C and C++ languages

11th Parallel Tools Workshop – 11/09/2017

License C C++ Fortran Source-to-source Documentation Weakness

GNU OSI ✓ ✓ ✓ ~ ~ GPL Licen

Misses information in AST

Cetus GPL ✓ x x ✓ ✓ Handle only C

Par4All MIT ✓ x ✓ ✓ Only for parallelism

LLVM BSD ✓ ✓ ~ ~ ~ No fortran when we stated Now first version of Flang

Rose BSD ✓ ✓ ✓ ✓ ✓ EDG license for C/C++

Orio BSD ~ x x ~ x Only subset of C

to other languages

✓ Requirement OK

~ Theoretically

possible / Weak

x Requirement KO

ASSIST

Page 35: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

35

Transformations

Unroll

• Unroll the body of a loop by a N factor

• Allow to reduce instructions that control the loop

• Reduce branch penalties

• Help the compiler to vectorize

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 36: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

36

Transformations

Full Unroll

• The loop is replaced by the body fully unrolled

• Same advantage as previously

• Remove the loop overhead

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 37: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

37

Transformations

Interchange

• Better access to array elements

• Moving from Column-major to Raw-major or inverse.

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 38: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

38

Transformations

Strip Mine

• Reorganizes a loop to iterate over blocks of data sized to fit in the cache

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 39: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

39

Transformations

Tilling / Blocking

• Strip mining applied to two more dimensions

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 40: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

40

Transformations

Generic Block Vectorization

• If the loop bound is not know

The loop will be specialized by checking the modulo of a given input

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 41: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

41

Transformations

Generic Block Vectorization

• If the loop bound is not know

The loop will be specialized by checking the modulo of a given input

11th Parallel Tools Workshop – 11/09/2017

ASSIST

Page 42: ASSIST: A Feedback-Directed Optimization source …...ASSIST: A Feedback-Directed Optimization source to source transformation tool for HPC applications William Jalby, Y. Lebras, Andres

42

Results

AVBP(SIMPLE) : Block vectorization Versus the specialization of function or loop Execution Time and Speedup (compare to the original version)

11th Parallel Tools Workshop – 11/09/2017

ASSIST

time(s) Speedup time(s) Speedup time(s) Speedup time(s)

Original version

Function specialization

Loop specialization

Block vectorization (on best case)

grad_4obj 3,862 1,62 2,38 1,55 2,49 2,04 1,89

scatter_o_add 3,78 0,85 4,44 1,21 3,13 0,97 3,88

scatter_add 4,164 1 4,16 0,99 4,22 1,38 3,01

scatter_o_sub 2,63 0,98 2,69 1 2,62 1,21 2,17

gather_o_cpy 16,324 0,81 20,12 1,04 15,68 1,28 12,76

balance_cor 0,492 1 0,49 1 0,49 1,24 0,39

central 0,86 1,35 0,64 1,59 0,54 1,85 0,46

central_nv 0,945 1,6 0,59 1,21 0,78 2,65 0,36

mass_product 2,238 1,02 2,84 1,27 2,69 2,58 1,49

laxwe 2,278 0,79 2,23 0,83 1,8 1,51 0,88