A Reproducible Research Methodology for Designing and

A Reproducible Research Methodology for Designing and

Conducting Faithful Simulations of Dynamic Task-based

Scienti�c Application

Luka Stanisic

Inria, Bordeaux Sud-Ouest, France

MPCDF seminarGarching

February 24, 2017

Background

Bachelor (CS specialty)

EE facultyBelgrade, Serbia

Phd(supervisors A. Legrand & J.F. Mehaut)

Grenoble, FranceModeling and simulation of dynamic

task-based applicationsMethodology for reproducible researchStatistical analysis, trace visualizations

Research Master(parallelism specialty)

Grenoble, FranceBenchmarking

CPU cache modelingARM vs Intel

2011 2012 2013 2014 2015 20172016

PostDocBordeaux, France

Performance optimizationLarge scale simulations

Modeling complex kernelsSimulating openQCD

2 / 29

Background

Bachelor (CS specialty)

EE facultyBelgrade, Serbia

Phd(supervisors A. Legrand & J.F. Mehaut)

Grenoble, FranceModeling and simulation of dynamic

task-based applicationsMethodology for reproducible researchStatistical analysis, trace visualizations

Research Master(parallelism specialty)

Grenoble, FranceBenchmarking

CPU cache modelingARM vs Intel

2011 2012 2013 2014 2015 20172016

PostDocBordeaux, France

Performance optimizationLarge scale simulations

Modeling complex kernelsSimulating openQCD

2 / 29

Parallel Programming Challenges

Communications and data placement

Synchronization of the workers

Computation duration variability scalability

Exploiting hybrid machines

Choosing granularity portability of code and performance

Theory

Practice

3 / 29

Di�erent Programming Approaches

Traditional, explicit programming models(MPI, CUDA, OpenMP, pthreads, . . . )

Perfect control maximal achievable performanceE�cient granularity advanced numerical features

Monolithic codes hard to develop and maintainFixed scheduling sensitive to variabilityHard and long to optimize performance portability

Recent task-based programming models(PaRSEC, OmpSs, Charm++, StarPU, . . . )

Single, abstract programming model based on DAGRuntime system responsible for dynamic schedulingPortability of code and performance

Introducing runtime system overheadDeveloping omnipotent runtime new challenges

4 / 29








4 / 29








POTRF

TRSM

TRSMTRSM TRSMGEMM

GEMM GEMM GEMM

GEMM

GEMM GEMM

GEMM

GEMM

GEMM

POTRF

TRSM

POTRF

TRSM

TRSM

POTRF

TRSM

TRSM

TRSM

POTRF

GEMM GEMM GEMM

GEMM

GEMM

GEMM

GEMM GEMM

GEMM

GEMM

4 / 29








4 / 29

Task-based Programming Paradigm

Tiled Cholesky Algorithmusing Sequential Task Flow (STF)

for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);

}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29







}}wait();

GEMM

SYRK

TRSM

POTRF

5 / 29

Need Regular Performance Evaluation

Native experiments

Complex systems

Wide variety of setups

Faithful but expensive

Model, equations, theory

PRAM, BSP, DAG

Scheduling bounds

Quick trends but simplistic

Simulation: running real code with machine abstraction

Advantages:

Reproducible executions (performance, bugs)

Predictions on unavailable architectures (extrapolation)

Richer experimental design possible

Di�culties:

Implementing more than a simple prototype

Hard to validate its reliability

6 / 29

Need Regular Performance Evaluation

Native experiments

Complex systems

Wide variety of setups

Faithful but expensive

Model, equations, theory

PRAM, BSP, DAG

Scheduling bounds

Quick trends but simplistic

Simulation: running real code with machine abstraction

Advantages:

Reproducible executions (performance, bugs)

Predictions on unavailable architectures (extrapolation)

Richer experimental design possible

Di�culties:

Implementing more than a simple prototype

Hard to validate its reliability

6 / 29

Research Statement

Is it possible to perform a clean, coherent, reproducible studyof HPC applications executed on top of dynamic task-basedruntime systems, using simulation?

7 / 29

Research Statement


7 / 29

Research Statement


7 / 29

Outline

1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary

2 Methodology for Reproducible Research

3 Conclusion

StarPU and SimGrid

StarPU (Inria Bordeaux)

Dynamic runtime system for hybrid architectures (CPU, GPU, MPI)

Opportunistic scheduling of a task graph guided by performance models

Features dense, sparse and FMM applications

SimGrid (Inria Grenoble, Lyon, Rennes, . . . )

Scalable simulation framework for distributed systems

Sound �uid network models accounting for heterogeneity and contention

Modeling with threads rather than only trace replay ability to simulate dynamic applications

Portable, open source and easily extendable

Same approach could be applicable to any task-based runtime system

8 / 29

Devised Work�ow: StarPU + SimGrid

StarPU

Performance Profile

Calibration

Run once!9 / 29

Devised Work�ow: StarPU + SimGrid

StarPU

SimGrid

Simulation

Quickly Simulate Many Times

StarPU

Performance Profile

Calibration

Run once!9 / 29

Implementation Principles

Emulation executing real applications in a synthetic environment

Simulation replace process execution by delays using performance models

StarPU applications and runtime system are emulated

similar scheduling

Thread synchronizations, actual computations, memory allocationsand data transfers are simulated

need for a good computational kernel and communication models

Control part of StarPU is modi�ed to inject into SimGrid runtimesystem, communication and computation delays

10 / 29

Modeling Runtime System

Simulation delays (increasing simulated time)

Process synchronizations

Memory allocations of CPU or GPU

Submission of data transfer requests

Example for CUDA memory allocation in StarPU...#ifdef STARPU_SIMGRID

MSG_process_sleep((float) dim * alloc_cost_per_byte);#else

if (_starpu_can_submit_cuda_task()) {cudaError_t cures;cures = cudaHostAlloc(A, dim, cudaHostAllocPortable);

...

11 / 29

Modeling Communication

Components of hybrid platforms have di�ering characteristics

Correctly modeling their communication is of primary importance

Built on exhaustively validated existing SimGrid network models

Flexible �ow-based contention model

CPU

GPU2

GPU1

GPU0

(a) Fatpipe (crude) model

CPU

GPU2

GPU1

GPU0

(b) Complete graph (pragmatic) model

12 / 29

Modeling Communication

Components of hybrid platforms have di�ering characteristics

Correctly modeling their communication is of primary importance

Built on exhaustively validated existing SimGrid network models

Flexible �ow-based contention model

GPU7GPU6GPU2 GPU3

CPU

GPU4GPU1GPU0 GPU5

(c) Treelike (elaborated) model

12 / 29

Modeling Computation

Actual computation results irrelevant only computation time matters

Task = Kernel for task-based paradigm

Execution of tasks replaced bysimulation delays

Average duration for regular kernels

Additional techniques to optionallyaccount for variability

GEMM

SYRK

TRSM

POTRF

13 / 29

Overview of Simulation Accuracy

Hannibal: 3 QuadroFX5800 Attila: 3 TeslaC2050 Mirage: 3 TeslaM2070 Conan: 3 TeslaM2075

01000200030004000

01000200030004000

Cholesky

LU

20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension

GF

lop/

s

Frogkepler: 2 K20 Pilipili2: 2 K40 Idgraf: 8 TeslaC2050

01000200030004000

01000200030004000

Cholesky

LU

20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension

GF

lop/

s ExperimentType

Native

SimGrid

Publications[1] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Faithful Performance Prediction of aDynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures. Concurrency and Computation:Practice and Experience, page 16, May 2015.[2] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Modeling and Simulation of a DynamicTask-Based Runtime System for Heterogeneous Multi-core Architectures. In Euro-par - 20th InternationalConference on Parallel Processing, Euro-Par 2014, LNCS 8632, pages 50�62, Porto, Portugal, Aug. 2014.

14 / 29

Overview of Simulation Accuracy

Publications[1] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Faithful Performance Prediction of aDynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures. Concurrency and Computation:Practice and Experience, page 16, May 2015.[2] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Modeling and Simulation of a DynamicTask-Based Runtime System for Heterogeneous Multi-core Architectures. In Euro-par - 20th InternationalConference on Parallel Processing, Euro-Par 2014, LNCS 8632, pages 50�62, Porto, Portugal, Aug. 2014.

14 / 29

Outline



3 Conclusion

Di�erence Between Regular and Irregular Kernels

Regular kernels use always the same block size duration is (relatively) stable

Irregular kernel durations depend on their input parameters need more than simple average values

Native, Do_subtree

0

2

4

6

0 50 100

Num

ber o

f Occ

uran

ces

Native, Activate

0

5

10

15

20

0 250 500 750

Native, Panel

0

50

100

150

200

0 25 50 75 100

Native, Update

0

2000

4000

0 20 40 60

Native, Assemble

0

50

100

150

200

250

0 10 20 30

Native, Deactivate

0

10

20

0 25 50 75

KernelDo_subtreeActivatePanelUpdateAssembleDeactivate

15 / 29

Modeling Duration of Complex Computation Kernels

Using statistical analysis and multiple linear regression

Extended StarPU to automatically support such models

Kernel duration estimations useful for both simulation and nativeexecutions (scheduling)

Native, Do_subtree

SimGrid, Do_subtree

0

3

6

9

0

3

6

9

0 50 100

Num

ber

of O

ccur

ance

s

Native, Activate

SimGrid, Activate

0

5

10

15

20

0

5

10

15

20

0 250 500 750

Native, Panel

SimGrid, Panel

0

100

200

0

100

200

0 25 50 75

Native, Update

SimGrid, Update

0

2000

4000

6000

0

2000

4000

6000

0 20 40 60

Native, Assemble

SimGrid, Assemble

0

100

200

300

0

100

200

300

0 10 20 30

Native, Deactivate

SimGrid, Deactivate

0

10

20

30

0

10

20

30

0 25 50 75

Kernel

Do_subtree

Activate

Panel

Update

Assemble

Deactivate

16 / 29

Simulating Irregular Numerical Libraries

Chameleon solver: dense linear algebra libraryqr_mumps solver: MUMPS multi-frontal factorizationScalFMM library: simulating N-body interactions using the FMMQDWH solver: QR-based Dynamically Weighted Halley (ongoing)

0

10

20

30

40

50

0

200

400

tp−6

karte

dEt

erni

tyII_

Ede

gme

hirla

m

e18

TF16

Ruc

ci1 sls

TF17

Dur

atio

n [s

] ExperimentType

Native

SimGrid

qr_mumps

0

100

200

300

2M 4M 8M 16M 32M 64MNumber of Particles

Dur

atio

n [s

] ExperimentType

Native

SimGrid

ScalFMM

Publication[3] L. Stanisic, E. Agullo, A. Buttari, A. Guermouche, A. Legrand, F. Lopez, and B. Videau. Fast and AccurateSimulation of Multithreaded Sparse Linear Algebra Solvers. Parallel and Distributed Systems (ICPADS), Dec. 2015.[4] E. Agullo, B. Bramas, O. Coulaud, L. Stanisic, and S. Thibault. Modeling Irregular Kernels of Task-based codes:Illustration with the Fast Multipole Method. submitted to Transaction on Mathematical Software (TOMS),[Research Report] 9036, INRIA Bordeaux. pp.35. Feb. 2017.

17 / 29

Outline



3 Conclusion

Comparing Di�erent Schedulers

Di�erences between schedulers performances faithfully predicted

DMDAR and DMDAS locality aware schedulers less transfers between GPUs

DMDA DMDAR DMDAS

0

500

1000

1500

20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension

GF

lop/

s ExperimentType

Native

SimGrid

18 / 29

Studying Memory Consumption

Minimizing memory footprint is very important for such applicationsRemember scheduling is dynamic so consecutive Native experimentshave di�erent output

Experiment number 1

Experiment number 2

Experiment number 3

Experiment number 4

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0 10,000 20,000 30,000 40,000Time [ms]

Allo

cate

d M

emor

y [G

iB]

19 / 29

Studying Memory Consumption

Minimizing memory footprint is very important for such applicationsRemember scheduling is dynamic so consecutive Native experimentshave di�erent output

Native 1

Native 2

SimGrid

Native 3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0 10,000 20,000 30,000 40,000Time [ms]

Allo

cate

d M

emor

y [G

iB]

19 / 29

Extrapolating to Larger Machines

Predicting performance in idealized contextStudying the parallelization limits of the problem

Extrapolation

0

30

60

90

4 10 20 40 100 400Number of Threads

Dur

atio

n [s

] ExperimentType

Native

SimGrid

20 / 29

Outline



3 Conclusion

Achievements

Works great for small hybrid setups with dense, sparse and FMMStarPU applications

Not only a prototype, already used by other researchers

Our solution allows to:

Debug applications on a commodity laptop in a reproducible wayDetect problems with real experiments using reliable comparisonTest di�erent scheduling alternativesEvaluate memory footprintQuickly and accurately evaluate the impact of variousscheduling/application parameters:

qr_mumps Cores RAM Evaluation MakespanNative 40 58.0GiB 157s 141sSimGrid 1 1.5GiB 57s 142s

21 / 29

Outline



3 Conclusion

Challenges of Experimental Studies in HPC

Large, hybrid, prototype hardware/software (hard to control)

Costly experiments with numerous parameters

Non-deterministic executions (overall duration, traces, . . . )

Work�ows speci�c to the studies (hardly applicable in general)

di�cult to make research results reproducible

22 / 29

Reproducible Research: Trying to Bridge the Gap

Analysis

Experiments

Experiment Code

(workload injector, VM recipes, ...)

Processing

Code

Analysis

Code

Presentation

Code

Analytic

Data

Computational

Results

Measured

Data

Numerical

Summaries

Figures

Tables

Text

Reader

Author

(Design of Experiments)

Protocol

Scienti�c

Question

Published

Article

Nature/System/...

Inspired by Roger D. Peng's lecture on reproducible research, May 201423 / 29

Reproducible Research: Trying to Bridge the Gap

Analysis

Experiments

Experiment Code

(workload injector, VM recipes, ...)

Processing

Code

Analysis

Code

Presentation

Code

Analytic

Data

Computational

Results

Measured

Data

Numerical

Summaries

Figures

Tables

Text

Reader

Author

(Design of Experiments)

Protocol

Scienti�c

Question

Published

Article

Nature/System/...

Our approach: use a lightweight combination of existing generic tools23 / 29

Experiments

Full control of design of experiments

Automatize process

Gather as much useful meta-data as possible for each experiment

Time

Experiment plan Memory allocationOperating system

Sequence order

Repetitions

Element type

Allocation technique

Scheduling priority

CPU frequency

Core pinning

Dedication

Optimization

Loop unrolling

Intel

ARMCycles

Size

Stride

Architecture Compilation Kernel

Bandwidth

Publication[5] L. Stanisic, L. M. Schnorr, A. Degomme, F. Heinrich, A. Legrand, and B. Videau. Characterizing thePerformance of Modern Architectures Through Opaque Benchmarks: Pitfalls Learned the Hard Way. submitted toInternational Workshop on Reproducibility in Parallel Computing (REPPAR), 2017.

24 / 29

Analysis

Write papers/reports with completely reproducible analysis

Rely on literate programming tools (IPython/Jupiter, Orgmode)

Modular scripting approach (shell + R)

730

CPE

368

ABE

434

CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7CPU8CPU9

CPU10CPU11CPU12CPU13CPU14CPU15CPU16CPU17CPU18CPU19CPU20CPU21CPU22CPU23CPU24CUDA0CUDA1CUDA2

0 200 400 600Time [ms]

Res

ourc

es

dgemm dpotrf dsyrk dtrsm Idle/Sleeping Critical Paths 1 2



Res

ourc

es

0

20

40

60

k ite

ratio

n

05000

100001500020000

# ta

sks

6696

5

CPE

2201

ABE

5974

8

0.7%0.9%1.0%0.9%0.9%1.0%1.0%0.8%1.0%1.0%1.0%0.9%0.9%1.0%1.3%1.3%1.2%1.3%1.4%1.4%1.6%1.5%1.6%1.4%1.6%

20.6%20.2%19.9%

6272

5

CPE

2149

ABE

5946

4

0.4%0.6%0.6%0.7%0.9%1.0%1.0%0.9%1.0%1.0%1.0%0.9%0.9%0.9%1.0%1.0%1.0%0.9%1.0%1.1%1.0%1.1%1.0%1.0%1.0%5.9%1.9%2.0%

6098

7

CPE

2146

ABE

5845

2

1.1%1.3%1.2%1.3%1.3%1.5%1.4%1.4%1.5%1.5%1.3%1.4%1.2%1.3%1.5%1.5%1.5%1.5%1.4%1.5%1.5%1.5%1.4%1.4%1.5%4.0%2.2%2.2%

0

20

40

60

0

500

1000

1500

0

500

1000

1500

05000

100001500020000

0

20

40

60

0

500

1000

1500

0

500

1000

1500

05000

100001500020000

0

20

40

60

0

500

1000

1500

0

500

1000

1500

0 20000 40000 60000 0 20000 40000 60000 0 20000 40000 60000

dgemm dpotrf dsyrk dtrsm Idle/Sleeping

0 20000 40000 60000 0 20000 40000 60000 0 20000 40000 60000Time [ms]

CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA

6406

1

CPE

2114

ABE

6000

4

12.5%12.8%12.3%11.6%13.5%13.6%14.6%14.3%12.0%11.6%11.2%3.2%2.7%3.1%3.7%3.8%3.1%2.6%3.7%4.0%4.1%3.7%2.9%3.9%3.0%3.6%2.2%2.6%

6017

4

CPE

2159

ABE

5901

7

1.0%0.8%1.1%1.3%1.3%1.5%1.5%1.7%1.7%1.8%1.8%1.8%1.9%2.0%2.0%2.1%2.3%2.3%2.3%2.2%2.3%2.3%2.5%2.5%2.4%2.5%1.1%0.9%

5957

7

CPE

2160

ABE

5760

3

0.9%1.3%0.9%1.0%1.0%0.9%1.0%1.1%0.9%0.9%1.0%1.0%0.9%1.1%1.1%1.0%1.2%1.0%1.2%1.0%1.1%1.1%1.2%0.9%0.9%3.2%1.4%1.4%



0

20

40

60

0 20000 40000 60000

0 20000 40000 60000

0 20000 40000 60000

0 20000 40000 60000

0 20000 40000 60000

0 20000 40000 60000Time [ms]

Res

ourc

esk

itera

tion

dgemm dpotrf dsyrk dtrsm Idle/Sleeping

05000

100001500020000

0

20

40

60

0500

10001500

0500

10001500

05000

100001500020000

0

20

40

60

0500

10001500

0500

10001500

05000

100001500020000

0

20

40

60

0500

10001500

0500

10001500

CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA

# ta

sks

Publication[6] V. G. Pinto, L. Stanisic, A. Legrand, L. M. Schnorr, and S. Thibault. Analyzing Dynamic Task-BasedApplications on Hybrid Platforms: An Agile Scripting Approach. 3rd Workshop on Visual Performance Analysis(VPA), Nov 2016, Salt Lake City, United States

25 / 29

Work�ow for the Whole Research Process

Documentation and experimentation journal (laboratory notebook)Unique Git branching system for better project history

src data

art/art1

xp/foo1

xp/foo2

Publications[7] L. Stanisic, A. Legrand, and V. Danjean. An E�ective Git And Org-Mode Based Work�ow For ReproducibleResearch. ACM SIGOPS Operating Systems Review, 49:61 � 70, 2015. Special Topic: Repeatability and Sharing ofExperimental Artifacts.[8] L. Stanisic and A. Legrand. E�ective Reproducible Research with Org-Mode and Git. In 1st InternationalWorkshop on Reproducibility in Parallel Computing, Porto, Portugal, Aug. 2014.

26 / 29

Achievements

Design:

Original approach based on well-known tools

Helps �lling the author/reader gap in our context

Applicable and extendable to other research �elds

Application:

Used this approach for many studies, presentations and papers

E�ciently handled ≈10,000 experiments (40GiB) and ≈2,000 commits

Evangelism:

Our closest colleagues successfully adopting this approach

Presented our methods on numerous occasions (RR webinar,conferences, workshops, ANR project meetings, . . . )

27 / 29

Outline



3 Conclusion

Experience

Modeling, simulation and performance evaluation

Methodology for reproducible research

Statistical analysis, visualizations

Code and performance debugging and optimizations

Working with large, hybrid, prototype hardware and software

Contributions to many large code projects:

StarPU (C) SimGrid (C/C++)qr_mumps (C/Fortran) ScalFMM (C++)Chameleon (C/Fortran)

28 / 29

Summary

Regular algorithmsDynamic task-based HPC applications

Research methodology

BenchmarksBasic modeling

2013 2014 2015 2016 2017 20192018

Numerical (irregular) libraries

Performance optimizationLarge scale executions

Real-life applicationsCollaboration with other

domain experts

Thank you!

http://mescal.imag.fr/membres/luka.stanisic/

29 / 29

Summary

Regular algorithmsDynamic task-based HPC applications

Research methodology

BenchmarksBasic modeling

2013 2014 2015 2016 2017 20192018

Numerical (irregular) libraries

Performance optimizationLarge scale executions

Real-life applicationsCollaboration with other

domain experts

Thank you!

http://mescal.imag.fr/membres/luka.stanisic/29 / 29

Ongoing Research: Multiple Nodes

StarPU-MPI + SimGrid for large scale distributed memory studies

Requires combining two modules of SimGrid framework technically challenging, need to rewrite internals

Large number of resources, kernels and communications in parallel need to optimize simulator performance

Multiple network models (PCI bus and Ethernet/In�niband) contention harder to model

Documents

A Reproducible Research Methodology for Designing and