An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core...

An Effective Dynamic Scheduling Runtime and Tuning

System for HeterogeneousMulti and Many-Core Desktop

Platforms

Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner

ytchen2012.09.19

Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion

Introduction • High performance platforms are commonly

required for scientific and engineering algorithms dealing appropriately with timing constraints.

• Both computation time and performance need to be optimized.

• Efficiency with respect to both huge domain sizes and with small problems is important.

Introduction • Our dynamic scheduling method combines a first

assignment phase for a set of high-level tasks (algorithms, for example), based on a pre-processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database.

Motivation • 3D Computational Fluid Dynamics (CFD)• large computations

o velocity field o local pressure

• Exampleo planeso Cars

Motivation• three iterative solvers for SLEs (Jacobi, Red-Black

Gauss-Seidel, and Conjugate Gradient)o Jacobi: determining the solutions of a system of linear

equations with largest absolute values in each row and column dominated by the diagonal element.

o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations.

o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite.

System overview• Units of Allocation (UA): is represented as a task.

Platform Independent Programming Model

• OpenCL• In its basic principle, the API encapsulates

implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent.

Profiler and Database• profiler monitors and stores tasks’ execution

times and characteristics in a timing performance database.

• input data (size and type), data transfers between PUs, among others.

Profiler and Database• The performance is measured in Host (CPU)

counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency.

Dynamic Scheduler • First, it establishes an initial scheduling guess

over the PUs just when the applications(s) starts.o First Assignment Phase – FAP

• Second, for every new arriving task, it performs a scheduling consulting the timing database.o Runtime Assignment Phase – RAP

First Assignment Phase – FAP

• Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs.

• lowest total execution time: o m: the number of Pus

• m = 2o n: the number of considered taskso i: tasko j: processor

Runtime Assignment Phase - RAP

• Modeled the arriving of new tasks as a FIFO (First In First Out) queue.

• assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain.

• When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size.

Experiment results• Domain sizes and execution costs of the tasks on

the PUs

Experiment results• Comparison of allocation heuristics

o 0-GPU, 1-CPU

Experiment results• Overhead of the dynamic scheduling using ALG.2

and its gain in comparison to scheduling all tasks to the GPU

Experiment results• Scheduling techniques for 24 tasks

o Overhead: the time to perform the schedulingo Solve time: the execution time to compute the tasks o Total time: overhead + solve timeo Error: the total time of the techniques in comparison to the optimal

solution without it overhead • ex: (7660-6130) / 6130

o Optimal: exhaustive search

Experiment results• Scheduling 24 tasks in the FAP + 42 tasks arriving

in the RAP

Related work • Distributed processing on a CPU-GPU platform

• Scheduling on a CPU-GPU platformo HEFT (Heterogeneous-Earliest-Finish-Time)

Related workStarPU this paper

execution model codelets OpenCL

method low-level high-level

motivation CFD matrix multiplication

system runtime system

scheduling database

Conclusion• This paper presents a context-aware runtime and

tuning system based on a compromise between reducing the execution time of engineering applications.

• We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL.

Conclusion• We achieved an execution time gain of 21.77% in

comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search.

Thanks for your listening!

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core...

Documents

Runtime Tools

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and

Runtime Presentatie

09-05 Teng MediaTek Keynote- Technologies for Automated ......MediaTek AI Extensions ModelTranslator Android NN Optimizer Quantization Others… Heterogeneous Runtime App Libs obj

Dandelion: a Compiler and Runtime for Heterogeneous Systems › en-us › research › wp-content › ... · 2018-01-04 · Dandelion: a Compiler and Runtime for Heterogeneous Systems

DevOps @ Runtime

COMPILER AND RUNTIME SUPPORT FOR HETEROGENEOUS …

Faithful Performance Prediction of a Dynamic Task-Based ... · Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures Luka

Objective-C Runtime Manipulation - GitHub Pageswbyoung.github.io/objective_c_runtime.pdf · Objective-C Runtime Manipulation Whitney Young FadingRed @wbyoung wbyoung.github.com. Runtime

Objective runtime

StarPU: task-based scalable Runtime system for ...1 StarPU: task-based scalable Runtime system for heterogeneous multicore architectures Olivier Aumage, Nathalie Furmento, Samuel Thibault

Runtime Analysis

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1

Runtime Environments

•Finding Bugs Using Xcode Runtime Tools...•Finding Bugs Using Xcode Runtime Tools • Session 406 Developer Tools Improvements in Runtime Checking Improvements in Runtime Checking

ArcGIS Runtime SDK for .NET - Getting Started · Agenda • What is the ArcGIS Runtime? • What’s new for ArcGIS developers? • ArcGIS Runtime SDK 10.2 for WPF • ArcGIS Runtime

Chris Rossbach, Yuan Yu, Jon Currey, JP Martin, Dennis Fetterly ... · DANDELION: A COMPILER AND RUNTIME FOR HETEROGENEOUS SYSTEMS Chris Rossbach, Yuan Yu, Jon Currey, JP Martin,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu

The Uintah Framework: A Unified Heterogeneous Task ...ahumphre/pubs/HPCAST-19-SCI-Uintah.pdf · The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan

A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen