An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core...

Preview:

Citation preview

1

An Effective Dynamic Scheduling Runtime and Tuning

System for HeterogeneousMulti and Many-Core Desktop

Platforms

Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner

ytchen2012.09.19

2

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

3

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

4

Introduction • High performance platforms are commonly

required for scientific and engineering algorithms dealing appropriately with timing constraints.

• Both computation time and performance need to be optimized.

• Efficiency with respect to both huge domain sizes and with small problems is important.

5

Introduction • Our dynamic scheduling method combines a first

assignment phase for a set of high-level tasks (algorithms, for example), based on a pre-processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database.

6

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

7

Motivation • 3D Computational Fluid Dynamics (CFD)• large computations

o velocity field o local pressure

• Exampleo planeso Cars

8

Motivation• three iterative solvers for SLEs (Jacobi, Red-Black

Gauss-Seidel, and Conjugate Gradient)o Jacobi: determining the solutions of a system of linear

equations with largest absolute values in each row and column dominated by the diagonal element.

o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations.

o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite.

9

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

10

System overview• Units of Allocation (UA): is represented as a task.

11

Platform Independent Programming Model

• OpenCL• In its basic principle, the API encapsulates

implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent.

12

Profiler and Database• profiler monitors and stores tasks’ execution

times and characteristics in a timing performance database.

• input data (size and type), data transfers between PUs, among others.

13

Profiler and Database• The performance is measured in Host (CPU)

counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency.

14

Dynamic Scheduler • First, it establishes an initial scheduling guess

over the PUs just when the applications(s) starts.o First Assignment Phase – FAP

• Second, for every new arriving task, it performs a scheduling consulting the timing database.o Runtime Assignment Phase – RAP

15

First Assignment Phase – FAP

• Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs.

• lowest total execution time: o m: the number of Pus

• m = 2o n: the number of considered taskso i: tasko j: processor

16

17

18

19

20

21

Runtime Assignment Phase - RAP

• Modeled the arriving of new tasks as a FIFO (First In First Out) queue.

• assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain.

• When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size.

22

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

23

Experiment results• Domain sizes and execution costs of the tasks on

the PUs

24

Experiment results• Comparison of allocation heuristics

o 0-GPU, 1-CPU

25

Experiment results• Overhead of the dynamic scheduling using ALG.2

and its gain in comparison to scheduling all tasks to the GPU

26

Experiment results• Scheduling techniques for 24 tasks

o Overhead: the time to perform the schedulingo Solve time: the execution time to compute the tasks o Total time: overhead + solve timeo Error: the total time of the techniques in comparison to the optimal

solution without it overhead • ex: (7660-6130) / 6130

o Optimal: exhaustive search

27

Experiment results• Scheduling 24 tasks in the FAP + 42 tasks arriving

in the RAP

28

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

29

Related work • Distributed processing on a CPU-GPU platform

• Scheduling on a CPU-GPU platformo HEFT (Heterogeneous-Earliest-Finish-Time)

30

Related workStarPU this paper

execution model codelets OpenCL

method low-level high-level

motivation CFD matrix multiplication

system runtime system

scheduling database

31

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

32

Conclusion• This paper presents a context-aware runtime and

tuning system based on a compromise between reducing the execution time of engineering applications.

• We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL.

33

Conclusion• We achieved an execution time gain of 21.77% in

comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search.

34

Thanks for your listening!

Recommended