1
Our system is a combinaon of server- side and client-side navigaon which aims to provide the most efficient navi- gaon in the context of smart cies. The routes are provided with regard to a global opmum for a city. This is a complex task which requires the com- putaonal power of the HPC infrastructure. Efficient operaon of the system is supported by the custom domain-specific lan- guage (DSL) LARA and the autotuning framework being developed within the ANTAREX project. Probabilisc Time-Dependent Travel Time Computaon algorithm has been selected for the demonstraon of the DSL and autotuning framework. The input for the algorithm is a departure me for a selected route represented as a line of road segments. The algorithm provides esmated proba- bility distribuon of the travel me along the specified route. The computaon is based on the Monte Carlo simulaon. It randomly selects probabil- ity speed profiles on road segments and computes travel me at the end of the route. Number of random samples greatly affects precision of the result. Many simulaons have to be executed in order to sasfy demand for the precise results of the simula- on in the context of the smart city. Self-adapve Navigaon System Instrumentaon via DSL Autotuning Integraon aspectdef MeasureExecTime { input fname, store, end select function.call{name==fname} end apply insert.before %{auto startTime = std::chrono::high_resolution_clock::now();}%; if(store != undefined) { insert.after ‘[[store]] = std::chrono::duration_cast<std::chrono::millisecon ds>(std::chrono::high_resolution_clock::now() - startTime).count();’; } else { insert.after std::chrono::duration_cast<std::chrono::milliseco nds>(std::chrono::high_resolution_clock::now() - startTime).count();’; } end } LARA ASPECT LARA ASPECT aspectdef ObserveScalability { input fname, inc, max_threads, runs end select function.call{name==fname} end apply insert.before %{‘ for (int threads = 1; threads <= [[max_threads]]; threads += [[inc]]) { omp_set_num_threads(threads); std::vector<long> times([[runs]]); std::vector<int> results([[runs]]); for(int r = 0; r < runs; r++) { }% insert.replace ‘results[r] = [[$call.statement]]’; insert.after %{‘ } int sum = 0; for (int r = 0; r < [[runs]]; ++r) { sum += times[r]; } float average = (sum/runs); } }% end aspectdef MeasureScalability { Call MeasureExecTime(“times[r]”); Call ObserveScalability (“RunMonteCarloAll”, 1, 16, 100); } INSERT OPENMP PRAGMA aspectdef OMPParFors { input fname end select function{name==fname}.loop end apply if($loop.is_parfor && $loop.is_outermost) { insert.before ‘#pragma omp parallel for’; } end } std::vector<float> RunMonteCarloAll(int samples, int secs,) { ... #pragma omp parallel for for (int s = 0; s < samples; ++s) { for (int d = 0; d < 7; ++d) { for (int i = 0; i < intervalsPerDay; ++i) { viRngUniform(VSL_RNG_METH_UNIF_STD, rndStream, probsSize, probs,0, INDEX_RESOL); travelTimes[(d * intervalsPerDay * samples) + (i * samples) + s] = GetRandomTravelTime(secs, probs); } } } ... INSTRUMENTED CODE EXPOSE SOFTWARE KNOBS aspectdef OMPCtrThreads { input fname, knob=“threads” end select function.call{name==fname} end apply insert.before %{omp_set_dynamic(0); omp_set_num_threads([[$knob]]);}%; } end } INTEGRATE AUTOTUNER aspectdef AutoTuneThreads { input fname end select function.call{name==“omp_set_num_threads”} end apply insert.before ‘Margo::MC::update([[$call.arg]]);; end select function{name==fname} end apply insert.before ‘Margo::MC::start_monitor();; insert.after ‘Margo::MC::end_monitor();; end } ... omp_set_dynamic(0); Margo::MC::update(&threads); omp_set_num_threads(threads); Margo::MC::start_monitor(); RunMonteCarloAll(segments, samples); Margo::MC::end_monitor(); ... INSTRUMENTED GOALS AND PARAMETERS <!-- SW-KNOB SECTION --> <knob name="num_threads" var_name="threads" var_type="int”/> <!-- GOAL SECTION --> <goal name=“MC_responseTime_goal” monitor=“MC_responseTime_monitor" dFun="Average" cFun="LT" value=“MC_SLA” /> <!-- OPTIMIZATION SECTION --> <state name="power_optimized" starting="yes" > <minimize combination="linear"> <metric name=“MC_avg_power" coef="1.0"/> </minimize> <subject to=“MC_responseTime” metric_name="MC_responseTime" priority="1" /> </state> DSL and Autotuning Tools for Code Opmisaon on HPC Inspired by Navigaon Use Case Jan Marnovič, Kateřina Slaninová and Marn Golasowski IT4Innovaons, VŠB - Technical University of Ostrava Radim Cmar Sygic Gianluca Palermo, Davide Gadioli, and Crisna Silvano DEIB, Politecnico di Milano Joao M. P. Cardoso, and Joao Bispo FEUP, University of Porto AutoTuning and Adapvity appRoach for Energy efficient eXascale HPC systems Scalability measurement of the simulaon code instrumented by the DSL No manual code changes are required. Performance of the code under different execuon condions is measured au- tomacally by combinaon of several LARA aspects. The condions are defined by several parameters: Number of OpenMP threads Thread scheduling strategy (compact, scaer) Cache usage strategy Scalability on OpenMP threads Single compute node, 2 x Intel® Xeon E5-2680v3 @ 2.5GHz, 12cores The main goal of the ANTAREX project is to provide a breakthrough approach to express by a Domain Specific Language the applicaon self-adapvity and to runme manage and autotune applicaons for green and heterogeneous High Performance Compung (HPC) systems up to the Exascale level. Dynamic self-monitoring and self-adapvity or «autotuning» HPC applicaons with respect to changing work- loads, operang condions and compung resources Programming models and languages to express self-adapvity and extra-funconal properes Introducing a separaon of concerns between extra-funconal strategies (self-adapvity, parallelisaon, energy/thermal management) and applicaon funconality by the design of a new aspect-oriented Do- main Specific Language Exploing heterogeneous compung resources in Green HPC plaorms by runme resource and power manage- ment ANTAREX Project Goals The domain specific language (DSL) and toolflow approach proposed in the ANTAREX project can pro- vide the mechanisms needed for addressing properly the problems previously described. The DSL be- ing developed is based on the LARA DSL and allows to specify strategies for code instrumentaon and code transformaons, including the required code adaptaon for dynamic autotuning. Integraon of mechanisms and strategies to on-line adapt the applicaon behaviour and the plaorm con- figuraon with respect to changing workloads, operang condions and compung resources. Searching the best combinaon of the soſtware knobs impacng the given met- rics (for example execuon me, energy consumpon, etc.) Definion of the goals to drive the opmi- zaon of the execuon environment (for example to comply with the target SLA) Thread scheduling strategy Intel® Xeon Phi 7210P @ 1.2 GHz, 61 cores Cache usage strategy Intel® Xeon Phi 7210P @ 1.2 GHz, 61 cores Apart from the instrumentaon, the aspects can be used to define the opmizaon logic which would select the best environment pa- rameters (knobs) according to selected metric (speedup, energy consumpon, etc.). Library of different instrumentaon strategies can be created, providing various useful anal- yses of the code performance. The main ad- vantage of this approach is separaon of the strategies (including instrumentaon and code transformaons) from the actual source code. The strategies can be programmed as generic and applicable to wide range of different source codes. INSTRUMENTED SOURCE CODE for (int threads = 1; threads <= max_threads; threads += increment) { omp_set_num_threads(threads); std::vector<long> times(runs); std::vector<int> results(runs); for(int r = 0; r < runs; r++) { auto startTime = std::chrono::high_resolution_clock::now(); results[r] = RunMonteCarloAll(segments, samples)[0]; times[r] = std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::high_resolution_clock::now() - startTime).count(); } int sum = 0; for (int r = 0; r < runs; ++r) { sum += times[r]; } float average = (sum/runs); } MeasureExecTime ObserveScalability AUTOTUNING Toolset workflow ANTAREX is supported by the EU H2020 FET-HPC program under grant 671623.

Instrumentation via DSL Self adaptive Navigation System ...jbispo/pubs/SC2016_poster.pdf · route represented as a line of road segments. The algorithm provides estimated proba-bility

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instrumentation via DSL Self adaptive Navigation System ...jbispo/pubs/SC2016_poster.pdf · route represented as a line of road segments. The algorithm provides estimated proba-bility

Our system is a combination of server-

side and client-side navigation which

aims to provide the most efficient navi-

gation in the context of smart cities.

The routes are provided with regard to

a global optimum for a city. This is a

complex task which requires the com-

putational power of the HPC infrastructure. Efficient operation of

the system is supported by the custom domain-specific lan-

guage (DSL) LARA and the autotuning framework being developed within the ANTAREX project.

Probabilistic Time-Dependent Travel Time Computation algorithm has been selected for the demonstration of the DSL and

autotuning framework. The input for the algorithm is a departure time for a selected

route represented as a line of road segments. The algorithm provides estimated proba-

bility distribution of the travel time along the specified route.

The computation is based on the Monte Carlo simulation. It randomly selects probabil-

ity speed profiles on road segments and computes travel time at the end of the route.

Number of random samples

greatly affects precision of

the result. Many simulations

have to be executed in order

to satisfy demand for the

precise results of the simula-

tion in the context of the

smart city.

Self-adaptive Navigation System Instrumentation via DSL Autotuning Integration

aspectdef MeasureExecTime { input fname, store, end select function.call{name==fname} end apply insert.before %{auto startTime = std::chrono::high_resolution_clock::now();}%; if(store != undefined) { insert.after ‘[[store]] = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count();’;

} else { insert.after ‘std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count();’; } end }

LARA ASPECT LARA ASPECT aspectdef ObserveScalability { input fname, inc, max_threads, runs end select function.call{name==fname} end apply insert.before %{‘ for (int threads = 1; threads <= [[max_threads]]; threads += [[inc]]) { omp_set_num_threads(threads); std::vector<long> times([[runs]]); std::vector<int> results([[runs]]); for(int r = 0; r < runs; r++) { }% insert.replace ‘results[r] = [[$call.statement]]’; insert.after %{‘ } int sum = 0; for (int r = 0; r < [[runs]]; ++r) { sum += times[r]; } float average = (sum/runs); } }% end

aspectdef MeasureScalability {

Call MeasureExecTime(“times[r]”);

Call ObserveScalability(“RunMonteCarloAll”, 1, 16, 100);

}

INSERT OPENMP PRAGMA aspectdef OMPParFors { input fname end select function{name==fname}.loop end apply

if($loop.is_parfor && $loop.is_outermost) { insert.before ‘#pragma omp parallel for’; } end }

std::vector<float> RunMonteCarloAll(int samples, int secs,) { ... #pragma omp parallel for for (int s = 0; s < samples; ++s) { for (int d = 0; d < 7; ++d) { for (int i = 0; i < intervalsPerDay; ++i) { viRngUniform(VSL_RNG_METH_UNIF_STD, rndStream, probsSize, probs,0, INDEX_RESOL); travelTimes[(d * intervalsPerDay * samples) + (i * samples) + s] = GetRandomTravelTime(secs, probs); } } } ...

INSTRUMENTED CODE

EXPOSE SOFTWARE KNOBS aspectdef OMPCtrThreads { input fname, knob=“threads” end select function.call{name==fname} end apply insert.before %{omp_set_dynamic(0); omp_set_num_threads([[$knob]]);}%; } end }

INTEGRATE AUTOTUNER aspectdef AutoTuneThreads { input fname end select function.call{name==“omp_set_num_threads”} end apply insert.before ‘Margo::MC::update([[$call.arg]]);’; end select function{name==fname} end apply insert.before ‘Margo::MC::start_monitor();’; insert.after ‘Margo::MC::end_monitor();’; end }

... omp_set_dynamic(0); Margo::MC::update(&threads); omp_set_num_threads(threads); Margo::MC::start_monitor(); RunMonteCarloAll(segments, samples); Margo::MC::end_monitor(); ...

INSTRUMENTED

GOALS AND PARAMETERS <!-- SW-KNOB SECTION --> <knob name="num_threads" var_name="threads" var_type="int”/> <!-- GOAL SECTION --> <goal name=“MC_responseTime_goal” monitor=“MC_responseTime_monitor" dFun="Average" cFun="LT" value=“MC_SLA” /> <!-- OPTIMIZATION SECTION --> <state name="power_optimized" starting="yes" > <minimize combination="linear"> <metric name=“MC_avg_power" coef="1.0"/> </minimize> <subject to=“MC_responseTime” metric_name="MC_responseTime" priority="1" /> </state>

DSL and Autotuning Tools for Code Optimisation

on HPC Inspired by Navigation Use Case

Jan Martinovič, Kateřina Slaninová

and Martin Golasowski IT4Innovations,

VŠB - Technical University of Ostrava

Radim Cmar Sygic Gianluca Palermo, Davide Gadioli,

and Cristina Silvano

DEIB, Politecnico di Milano

Joao M. P. Cardoso, and Joao Bispo

FEUP, University of Porto

AutoTuning and Adaptivity appRoach for

Energy efficient eXascale HPC systems

Scalability measurement of the simulation code instrumented by the DSL

No manual code changes are required. Performance of the code under different execution conditions is measured au-tomatically by combination of several LARA aspects. The conditions are defined by several parameters:

Number of OpenMP threads

Thread scheduling strategy (compact, scatter)

Cache usage strategy

Scalability on OpenMP threads Single compute node, 2 x Intel® Xeon E5-2680v3 @ 2.5GHz, 12cores

The main goal of the ANTAREX project is to provide a breakthrough approach to express by a Domain Specific Language

the application self-adaptivity and to runtime manage and autotune applications for green and heterogeneous High

Performance Computing (HPC) systems up to the Exascale level.

Dynamic self-monitoring and self-adaptivity or «autotuning» HPC applications with respect to changing work-

loads, operating conditions and computing resources

Programming models and languages to express self-adaptivity and extra-functional properties

Introducing a separation of concerns between extra-functional strategies (self-adaptivity, parallelisation,

energy/thermal management) and application functionality by the design of a new aspect-oriented Do-

main Specific Language Exploiting heterogeneous computing resources in Green HPC platforms by runtime resource and power manage-

ment

ANTAREX Project Goals

The domain specific language (DSL) and toolflow approach proposed in the ANTAREX project can pro-

vide the mechanisms needed for addressing properly the problems previously described. The DSL be-

ing developed is based on the LARA DSL and allows to specify strategies for code instrumentation and

code transformations, including the required code adaptation for dynamic autotuning.

Integration of mechanisms and strategies to on-line adapt the application behaviour and the platform con-

figuration with respect to changing workloads, operating conditions and computing resources.

Searching the best combination of the

software knobs impacting the given met-

rics (for example execution time, energy

consumption, etc.)

Definition of the goals to drive the optimi-

zation of the execution environment (for

example to comply with the target SLA)

Thread scheduling strategy Intel® Xeon Phi 7210P @ 1.2 GHz, 61 cores

Cache usage strategy Intel® Xeon Phi 7210P @ 1.2 GHz, 61 cores

Apart from the instrumentation, the aspects

can be used to define the optimization logic

which would select the best environment pa-

rameters (knobs) according to selected metric

(speedup, energy consumption, etc.).

Library of different instrumentation strategies

can be created, providing various useful anal-

yses of the code performance. The main ad-

vantage of this approach is separation of the

strategies (including instrumentation and code

transformations) from the actual source code.

The strategies can be programmed as generic

and applicable to wide range of different

source codes.

INSTRUMENTED SOURCE CODE

for (int threads = 1; threads <= max_threads; threads += increment) { omp_set_num_threads(threads); std::vector<long> times(runs); std::vector<int> results(runs); for(int r = 0; r < runs; r++) { auto startTime = std::chrono::high_resolution_clock::now();

results[r] = RunMonteCarloAll(segments, samples)[0]; times[r] = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); } int sum = 0; for (int r = 0; r < runs; ++r) { sum += times[r]; } float average = (sum/runs); }

MeasureExecTime

ObserveScalability

AUTOTUNING

Toolset workflow

ANTAREX is supported by the EU H2020 FET-HPC program under grant 671623.