Dynamic task migration in HPC for Exascale challenge · -Copy image to destination node-Resume job there Software Stack: - Resource manager: SLURM - MPI library: mvapich - OpenMP

Dynamic task migration in HPC for Exascale challengeM. Rodríguez Pascual, J.A. Moríñigo, R. Mayo-García

Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas CIEMAT. Madrid, Spain

Reduce energy consumption by hibernating iddle nodes

What are we doing?

New possibilities in scheduling policies

RES- 11th Users' Conference28-29 Sept. 2017 Santiago de Compostela

- Implemement full support for checkpoint/restart in Slurm through the usage of DMTCP, a chechpoint library

- Create dynamic scheduling algorithms

- Creation of tools for system administration

- We can migrate Serial, MPI and hybrid MPI+OpenMP codes

- All process is transparent to final users

Available possibilities

Migration of single jobs

How are we doing it? Why are we doing it?

Migration process:-Chekpoint running job-Copy image to destination node-Resume job there

Software Stack:

- Resource manager: SLURM- MPI library: mvapich- OpenMP- Checkpointing mechanism: DMTCP- Container: Docker

Because it is great!

We want to provide a whole set of new tools for HPC management

This solution can:

-provide fault tolerance, basic in long running applications and highly parallel ones for the exascale era

-increase performace: spread tasks in cluster to maximice disk & network availability

-reduce energy consumption: concentrate tasks in the cluster so part of it remains iddle & can be hibernated or powered down

-reduce communication overhead: position tasks that communicate close to each other

All together, it allows creating sophisticated scheduling

We expect it to have an impact on next generation of HPC systems

Open Problems

- Ensure scalability on very large parallel jobs

- Migration on heterogeneous infrastructures

- Provide support for Docker with DMTCP

Slurm

MVAPICHDMTCPDMTCP

Migration of parallel jobs Migration of software containers

Future work

- Create a profiling tool able to monitor memory usage.

- Provide support for GPU and Xeon Phi

- Support heterogeneous clusters

- Inter-cluster job migration

Reduce MPI network overhead

Increase data locality Maximice performance

This work was supported by the COST Action NESUS (IC1305) and partially funded by the Spanish Ministry of Economy, Industry and Competitiveness project CODEC2 (TIN2015-63562-R) with FEDER funds and EU H2020 project HPC4E (grant agreement n 689772).

Documents

Dynamic task migration in HPC for Exascale challenge · -Copy image to destination node-Resume job there Software Stack: - Resource manager: SLURM - MPI library: mvapich - OpenMP