Contributions for Resource and Job Management in High ...mescal.imag.fr/membres/yiannis.georgiou/... · I How can we evaluate the performance of a RJMS and its internal functions?

Contributions for Resource and Job Management inHigh Performance Computing

Yiannis Georgiou

Introduction

Introduction

High Performance Computing is defined by:

Infrastructures:I Supercomputers, Clusters, Grids,

Peer-to-Peer Systems and latelyClouds

Applications:I Climate Prediction, Protein

Folding, Crash simulation,High-Energy Physics,Astrophysics, Animation for movieand video game productions

System SoftwareI System Software: Operating

System, Runtime system, ResourceManagement, I/O Systems,Interfacing to External Environments

2 / 68

Introduction

High Performance Computing Evolutions

Computational infrastructures

Advances in computer architecture (Top 500supercomputer sites)

I multiprocessor technologies: 85%quadcore (2010) up from 1% dualcore(2005)

I Use of Clusters 80% (2010) up from 42%(2003)

I Peak performance: 1st position FromGigaFlops (1997) to Petaflops (2008)

I increase of energy consumption: average397 kWatt (2010) up from 257 kwatt (2008)

Scientific Applications

I Need for more computing power

I quality of services

I deep parallelism

I fault tolerance

What is the impact of those evolutions uponthe Resource and Job Management System?

3 / 68

Introduction

High Performance Computing Evolutions

Computational infrastructures

Advances in computer architecture (Top 500supercomputer sites)

I multiprocessor technologies: 85%quadcore (2010) up from 1% dualcore(2005)

I Use of Clusters 80% (2010) up from 42%(2003)

I Peak performance: 1st position FromGigaFlops (1997) to Petaflops (2008)

I increase of energy consumption: average397 kWatt (2010) up from 257 kwatt (2008)

Scientific Applications

I Need for more computing power

I quality of services

I deep parallelism

I fault tolerance

What is the impact of those evolutions uponthe Resource and Job Management System?

4 / 68

Introduction

Resource and Job Management

The goal of a Resource and Job Management System (RJMS) is to satisfy users’demands for computation and assign user jobs upon the computational resources in anefficient manner.

RJMS Importance

Strategic position butcomplex internals:

I Direct and constantknowledge ofresources and jobs

I Multifacetprocedures withcomplex internalfunctions

5 / 68

Introduction

Study of Resource and Job Management Systems

Studying the RJMSI How can the Resource and Job Management deal with the new technological

evolutions and the applications needs?

Experimentation Methodology to evaluate the RJMSI How can we evaluate the performance of a RJMS and its internal functions?

Motivations for system exploitation improvements

Logs of Real Parallel Workloads show important un-utilization periods (e.g.LLNL Atlas cluster with 9,216cores 64.1% of system utilization during 8months).

Improving the RJMSI How can we obtain beneficial system exploitation through a Resource and Job

Management System by taking advantage of systems un-utilized resources?

6 / 68

Introduction

Plan

Introduction

Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison

Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS

Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques

Conclusions and Perspectives

Appendix - References

7 / 68

Resource and Job Management Systems analysis

Plan

Introduction






8 / 68

Resource and Job Management Systems analysis Concepts and State of the art

RJMS abstraction layers

This assignement involves three principal abstraction layers:I Job Management layer

I the Scheduling layer

I and the Resource Management layer

9 / 68


Research Challenges

Network Topological constraints

Internal Node Topological constraints

Hardware affinities: Multiprocessors, memory hierarchies...

Heterogeneity

Resources: Multicore SMPs, NUMA, GPUs and Infrastructures: Clusters,Multiclusters, Grids, Clouds, Desktop Grids...

Energy Efficiency

Needs to be taken into account: RJMS knowledge of resources and workloads

Resource ManagementI New techniques and algorithms

(e.g. Best-fit algorithms, CPUSETs, fault-tolerancetechniques,...)

Job ManagementI New concepts and expressions

(e.g. type of jobs, interfacing APIs...)

SchedulingI Larger pool of possibilities.I Needs optimized logics and algorithms.I Advanced Scheduling policies

(e.g. gang-scheduling, preemption,...)

10 / 68


Research Challenges




Heterogeneity


Energy Efficiency







(e.g. gang-scheduling, preemption,...)

11 / 68


Research Challenges




Heterogeneity


Energy Efficiency







(e.g. gang-scheduling, preemption,...) 12 / 68

Resource and Job Management Systems analysis RJMS different approaches

Used Resource and Job Management Systems

Open Source RJMSI SLURMI TORQUEI OARI MAUII CONDORI SGE (before Oracle)

Commercial RJMSI LoadlevelerI LSFI MOABI PBSProI OGE (Oracle Grid

Engine)

Contribution: RJMS Comparison studyI Quantifiable Functionalities Evaluation of

opensource and commercial RJMSI Experimental Performance Evaluation of

opensource RJMS

13 / 68


Used Resource and Job Management Systems

Open Source RJMSI SLURMI TORQUEI OARI MAUII CONDORI SGE (before Oracle)

Commercial RJMSI LoadlevelerI LSFI MOABI PBSProI OGE (Oracle Grid

Engine)

Contribution: RJMS Comparison studyI Quantifiable Functionalities Evaluation of

opensource and commercial RJMSI Experimental Performance Evaluation of

opensource RJMS

14 / 68


SLURM ...towards scalability

Designed for scalabilityI one central controller and a

daemon upon each computingnode

I daemons upon all nodes forsecure authentication (munged)

I management of resources withpartitions (groups with specificcharacteristics)

I general purpose plug-inmechanism

Special featuresI highly scalable launching and

scheduling algorithmsI best-fit topology (network and

internal node) aware schedulingI advanced scheduling policies

(preemption, gang-scheduling)I evolving jobs support (shrink in size)

FactsI Highly scalable but less flexible

to support different environments.I Used in 40% of the most powerful

production systems in TOP500.15 / 68


OAR ...towards versatility

Design based on High levelcomponents

I Relational database(MySql/PostgreSQL) as the kernelto store and exchange data

I Script languages (Perl, Ruby) forthe execution engine and modules

I No deamon on computing nodes.Use of GNU/Linux components:SSH, CPUSETS, Taktuk

Special featuresI Hierarchical managing of resources

and requests expressionsI Heterogeneous resources (e.g.

licences, storage capacity, networkcapacity...)

I REST-full API for interfacing withexternal environments (ISVApplications, Web portals, Grids...)

I Multiple task types (e.g. besteffort,environment deployment, moldable)

FactsI Highly versatile with short

development cycles but notscalable enough.

I Used in production (e.g. Grid5000,Ciment) and in research (e.g.Green-Net, DSL-Lab)

16 / 68

Resource and Job Management Systems analysis Functionalities Comparison

RJMS Quantifiable Functionalities Comparison

Quantifying Functionalities support by RJMSI Resource Management

I Resources Treatment, Job Launching, Task Placement, High Availability,...

I Job ManagementI Job declaration, Job Control, Monitoring, Interfacing, Quality of Services,...

I SchedulingI Scheduling Algorithms, Queues Management, Advanced Reservations,...

Overall Evaluation / SLURM TORQUE OAR SGE MAUI LSF ...RJMS Software

Overall Resource Management (/10) 7.1 5.2 6.9 5.9 1.9 6.9 ...Overall Job Management (/10) 5.1 5.1 5.5 6 3.1 6.8 ...

Scheduling (/10) 6 3 5.7 5.7 5.5 5.7 ...Overall Evaluation Points (/10) 6.2 5.7 5.1 6 5.9 6.4 ...

17 / 68

Resource and Job Management Systems analysis Functionalities Comparison

RJMS Quantifiable Functionalities Comparison

Quantifying Functionalities support by RJMSI Resource Management

I Resources Treatment, Job Launching, Task Placement, High Availability,...

I Job ManagementI Job declaration, Job Control, Monitoring, Interfacing, Quality of Services,...

I SchedulingI Scheduling Algorithms, Queues Management, Advanced Reservations,...

Overall Evaluation / SLURM TORQUE OAR SGE MAUI LSF ...RJMS Software

Overall Resource Management (/10) 7.1 5.2 6.9 5.9 1.9 6.9 ...Overall Job Management (/10) 5.1 5.1 5.5 6 3.1 6.8 ...

Scheduling (/10) 6 3 5.7 5.7 5.5 5.7 ...Overall Evaluation Points (/10) 6.2 5.7 5.1 6 5.9 6.4 ...

18 / 68

Experimental Methodology for RJMS evaluation

Plan

Introduction






19 / 68


Motivations for new experimental methodology

Experimental Methodologies for HPCI large number of parameters and conditionsI simulators or emulators valuable but not enoughI Real-scale experimentation need

Workload ModellingI Performance evaluation by executing a sequence of jobs.I Two common ways to use a workload for system evaluation.

1. Either a workload log (trace)2. or a workload model (synthetic workload like ESP)

Contribution: Developed RJMSevaluation methodology

Real-scale experimentation with workloadinjection

20 / 68


Motivations for new experimental methodology

Experimental Methodologies for HPCI large number of parameters and conditionsI simulators or emulators valuable but not enoughI Real-scale experimentation need

Workload ModellingI Performance evaluation by executing a sequence of jobs.I Two common ways to use a workload for system evaluation.

1. Either a workload log (trace)2. or a workload model (synthetic workload like ESP)

Contribution: Developed RJMSevaluation methodology

Real-scale experimentation with workloadinjection

21 / 68

Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection

Experimental Methodology - General Principles

1. Real-Scale experimentation upon dedicated platforms (like Grid5000): Controland Reproduction of experiments.

2. Injection of characterised workloads (like ESP benchmark) to observe thebehaviour of the RJMS under particular conditions.

3. Extraction of the produced workload trace and post-treatment analysis of theresults.

22 / 68


Grid5000 Experimental platform

I Grid5000 Experimental grid1 [BOLZE’06], large-scale distributed platform thatcan be easily controlled, reconfigured and monitored.

I Deep reconfiguration mechanisms sothat each user can use exactly theenvironment needed for its experiments

Applications

OS(Linux, FreeBSD,...)

EnvironmentMiddleware

Hardware Network

SpecifiableToolsDistro

Configurable

1https://www.grid5000.fr/23 / 68


ESP Benchmark

I provide a quantitative evaluation of launching and scheduling via a single metric [1]

I Complete independence from the hardware performance

ESPEfficiency =TheoreticDurationMeasuredDuration

(1)

Job Type Fraction of Job Size Job size for a 512cores Count of the number Target Run Timerelative to total system cluster (in cores) of job instance (Seconds)

sizeA 0.03125 16 75 267B 0.06250 32 9 322... ... ... ... ...L 0.12500 64 36 366M 0.25000 128 15 187Z 1.00000 512 2 100

Total 230

Table: ESP benchmark 2 [KRAMER’08] characteristics (case for 512 cores cluster)

2http://www.nersc.gov/projects/esp.php24 / 68


Proposed TOPO-ESP-NAS Benchmark

I Modification of the default ESP benchmark for topology aware schedulingevaluation

I Use of NAS benchmarks: CG application most communication sensitive.I Not use of fixed target run times for the executing applicationsI Calculate the theoretic duration of [2] considering the ideal execution times for

each different class of jobs

TOPO.ESP.NASEfficiency =TheoreticDuration(IdealTopology)

MeasuredDuration(2)

25 / 68


Steps for RJMS Experimentation

26 / 68



27 / 68



28 / 68



29 / 68



Developed Xionee

https://gforge.inria.fr/projects/xionee/

Collection of tools to manage (inject, extract and analyze)workload traces

30 / 68



Developed Xionee

https://gforge.inria.fr/projects/xionee/

Collection of tools to manage (inject, extract and analyze)workload traces

31 / 68

Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS

Plan

Introduction






32 / 68


RJMS ESP Efficiency Comparisons

Experimental Testbed detailsI Dedicated cluster 1 RJMS central controller, 8 computing nodes - 64 cores

(dualCPU-quadCORE Intel Xeon E5420 QC 2.5 GHz, 8GB RAM, Infiniband 20G).

RJMS/policy OAR SLURM TORQUE+Mauibackfill 83.7% 83.9% 83.1%

preemption Not Supported 84.9% 85.4%gang-scheduling Not Supported 94.8% Not Supported

Table: ESP benchmark Efficiency Percentage for different RJMS and scheduling policies - OAR,SLURM and Torque+Maui experiments upon a cluster of 64 resources

Results AnalysisI gang scheduling the best performance, allowing the efficient filling up of all the

’holes’ in the scheduling spaceI however this is due to the simplicity of the particular application suspend/resume

happens on memory , no swapping neededI preemption better than backfill due to the 2 higher priority ”all resources jobs”

(overhead for backfilling, direct execution for preemption). 33 / 68


Scaling the size of the cluster: ESP Efficiency

Experimental Testbed details

I Dedicated cluster 1 SLURM central controller, 288 computing nodes - 9216 cores(quadCPU-octoCORE Intel Xeon 7500, Infiniband).

I Part of the BULL-CEA Tera 100 cluster during its installation-testing phase (1.25 Petaflops theoreticalpeak performance).

Cluster size (Number of cores) 512 9216Average Jobs Waiting time (sec) 2766 2919

Total Workload Execution time (sec) 12992 13099ESP Efficieny for SLURM backfill+preemption policy 82.9% 82.3%

0

100

200

300

400

500

600

0 2000 4000 6000 8000 10000 12000 14000 16000

Num

ber

of C

ores

Time sec

System utilization for ESP synthetic workload and SLURM - 512 cores

SLURMJob Start Time Impulse

0

2000

4000

6000

8000

10000

0 2000 4000 6000 8000 10000 12000 14000 16000

Cor

es U

tiliz

atio

n

Time sec

System utilization for ESP synthetic workload and SLURM - 9216 cores

SLURMJob Start Time Impulse

34 / 68


Scaling the size of the cluster: Jobs Waiting Time

0 1000 2000 3000 4000 5000 6000

0.0

0.2

0.4

0.6

0.8

1.0

Cumulated Distribution Function on Waiting time for ESP benchmark and SLURM backfill+preemption scheduler

Wait time [sec]

Jobs

[%]

0 1000 2000 3000 4000 5000 6000

0.0

0.2

0.4

0.6

0.8

1.0

Cumulated Distribution Function on Waiting time for ESP benchmark and SLURM backfill+preemption scheduler

Wait time [sec]

Jobs

[%]

SLURM512SLURM9216

35 / 68


Scaling the number of submitted jobs:Backfill VS Preemption policy

I Submission burst of small granularity jobs (1 core per job) for simple sleep executionsI 2nd case without preemption between jobsI Degradation problems with backfill+preemption

0 50 100 150 200 250 300

050

100

150

Instant Throughput for 7000 submitted jobs (1core each) upon a 10240 cores cluster (Backfill+Preemption Mode)

Time(sec)

Num

ber

of J

obs

36 / 68


Network Topology Aware Placement Evaluations

Experimental Testbed details

I Dedicated cluster 1 SLURM central controller, 128 computing nodes - 4096 cores(quadCPU-octoCORE Intel Xeon 7500).

I Network topological constraints, 2 different islands, 64 nodes per island: higher bandwidth in the sameisland.

SLURM NB cores-TOPO Cons / Topo-ESP-NAS-Results Theoretic- Ideal values 4096 NO-Topology Aware 4096 Topology AwareTotal Execution time(sec) 12227 17518 16985Average Wait time(sec) - 4575 4617

Average Execution time(sec) - 502 483Efficieny for Topo-ESP-NAS 100% 69.8% 72.0%

Jobs on 1 island 228 165 183Jobs on 2 islands 2 65 47

Table: TOPO-ESP-NAS benchmark results for 4096 resources cluster

37 / 68

Approaches for System Exploitation Improvements

Plan

Introduction






38 / 68


Motivations for exploitation of un-utilized resources

I Systems may present important under-utilization periods due to:I jobs interarrival times and cancelation rates.I the complexity of workloads, with variations in resources demands and execution

times

I Issue: The high dynamicity of users workload may result in bad schedulingperformance producing long waiting jobs, big turnaround times.

Workload Traces From Until Months CPUS Jobs Users Utilization %LANL O2K Nov 1999 Apr 2000 5 2,048 121,989 337 64.0

OSC Cluster Jan 2000 Nov 2001 22 57 80,714 254 43.1SDSC BLUE Apr 2000 Jan 2003 32 1,152 250,440 468 76.2

HPC2N Jul 2002 Jan 2006 42 240 527,371 258 72.0SDSC DataStar Mar 2004 Apr 2005 13 1,664 96,089 460 62.8

LPC EGEE Aug 2004 May 2005 9 140 244,821 57 20.8LLNL uBGL Nov 2006 Jun 2007 7 2,048 112,611 62 56.1LLNL Atlas Nov 2006 Jun 2007 8 9,216 60,332 132 64.1

LLNL Thunder Jan 2007 Jun 2007 5 4,008 128,662 283 87.6

Table: Logs of Real Parallel Workloads from Production Systems [FEITELSON’02]

39 / 68


Taking advantage of the clusters unutilized resources...

1... through dynamic jobsI Adapting dynamic jobs to resources availabilities

2... through a Lightweight GridI Aggregating unutilized cluster resources for use in global computing

3... for energy efficiencyI Allowing unutilized resources to be powered off in order to perform clusters

energy reductions.

40 / 68

Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques

Plan

Introduction






41 / 68


Background Information

Who decides When decidedAt submission During Execution

Application Rigid EvolvingSystem Moldable Malleable

Table: Classification of parallel applications [FEITELSON’96]

Motivations for malleability support upon an RJMSI a lot of research made in theory

I Programming Dynamic ApplicationsI RJMS and DynamicityI Communication protocols between RJMS and Applications

I ...few adaptive parallel applications exist in real worldI and limited RJMS support for dynamic applications

42 / 68


Background Information

Who decides When decidedAt submission During Execution

Application Rigid EvolvingSystem Moldable Malleable

Table: Classification of parallel applications [FEITELSON’96]

Motivations for malleability support upon an RJMSI a lot of research made in theory

I Programming Dynamic ApplicationsI RJMS and DynamicityI Communication protocols between RJMS and Applications

I ...few adaptive parallel applications exist in real worldI and limited RJMS support for dynamic applications

43 / 68


Implementation detailsMalleability support implementation upon OAR RJMS

I the Malleability automaton for decision making

I a Resource Discovery command

I a Communication Protocol

I scheduling of only one malleable job is allowed at a time

I a Normal job for the rigid part to guarantee termination

I the Malleability worker for the management of dynamic operations

I Best Effort jobs for the adaptability and non interference to the cluster

Malleability support

One architecture, twotechniques:

I Dynamic MPI

I Dynamic CPUSETMapping

44 / 68


Dynamic MPI technique

Growing Shrinking

CLUSTERNODE 1 NODE 2 NODE 3 NODE 4

Application Process

Rigid job (malleable)

Best Effort job (malleable)

Rigid job (external)

CORE 1

CORE 2

CORE 3

CORE 4

Details

I supports only malleable applicationswith MPI Comm spawn primitives ofMPI-2 norm

I lib-dynamicMPI [CERA’06] libraryfor communication between theMPI Comm spawn calls and OAR

I LAM-MPI implementation of MPI-2(lamgrow/lamshrink)

45 / 68


Dynamic CPUSET Mapping

Growing Shrinking

NODECORE 1 CORE 2 CORE 3 CORE 4

Application Process

Rigid job (malleable)

Best Effort job (malleable)

Rigid job (external)

Details

I CPUSETs Linux kernel objects forpartitioning multiprocessor machineby creating execution areas.

I on-the-fly manipulation of theamount of allocated cores per node.

I functions with every parallelapplication

46 / 68


Real-Scale Experimentation Testbed

Infrastructure and Applications Characteristics

Infrastructure:

I Dedicated Cluster (Grid5000): 1 OAR server and 32 computing nodes (IBM System x3455 composedby DualCPU-DualCore AMD Opteron 2218 (2.6 GHz/2 MB L2 cache/800 MHz) and 4GB memory pernode with Gigabit Ethernet).

Applications:

I Mandelbrot reprogrammed in malleable using MPI Comm spawn primitives of MPI-2 norm forDynamic MPI.

I NAS class C benchmarks with their MPI3.3 implementation: BT (Block Tridiagonal Solver) and CG(Conjugate Gradient) for Dynamic CPUSETs Mapping

Experiment Details

I Workload Traces Injection from DAS-2 clusters (slice of 5 hours with 40% cluster utilization)representing the normal workload of the cluster.

I One malleable job per time is submitted and will run upon the free resources, i.e. those that are notused by the normal workload.

I Dynamic Malleable approaches Comparison with static moldable-besteffort jobs submission(moldable job that starts its execution upon all free cluster resources and remains without changesuntil its execution ending).

47 / 68


Malleability evaluation

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

# of

cor

es u

sed

Time of simulation (5 hours)

malleableworkload

Figure: Malleable jobs executing BTapplication upon the free resources of thenormal workload.

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

# of

cor

es u

sed


moldable−besteffortworkload

Figure: Moldable-besteffort jobs executing BTapplication upon the free resources of thenormal workload.

Results in percentages

Normal workload utilization (injected): 40%Malleable jobs utilization (measured): 57%Moldable-besteffort jobs utilization (measured) : 32%Gain : 25% of system utilization

Future WorkI Extend the prototype to support multiple

malleable jobsI Deeper experimentation of the shared memory

contexts for the CPUSET Mapping

48 / 68


Malleability evaluation

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

# of

cor

es u

sed


malleableworkload

Figure: Malleable jobs executing BTapplication upon the free resources of thenormal workload.

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

# of

cor

es u

sed


moldable−besteffortworkload

Figure: Moldable-besteffort jobs executing BTapplication upon the free resources of thenormal workload.

Results in percentages

Normal workload utilization (injected): 40%Malleable jobs utilization (measured): 57%Moldable-besteffort jobs utilization (measured) : 32%Gain : 25% of system utilization

Future WorkI Extend the prototype to support multiple

malleable jobsI Deeper experimentation of the shared memory

contexts for the CPUSET Mapping

49 / 68

Approaches for System Exploitation Improvements Improving system utilization in a lightweight grid context

Plan

Introduction






50 / 68


Motivations for system utilization improvements

Context

I Lightweight grid CIGRI aggregation of clustersun-utilized resources for bag-of-tasksapplications through local OAR Besteffort jobs

Related Work

I different from mainstream: Grids like Globus...

I similar to alternative: Desktop Grids-VolunteerComputing like Condor, OurGrid, Boinc(Seti@home)...

Problematic

I Improving system utilization (more jobs can beexecuted in case of un-utilized resources)...

I but resources volatility...

I turn useless jobs computation when killed...they restart from the begginning

51 / 68


Developments and Future Research

DevelopmentsI Improve beneficial system

utilization throughCheckpoint/Restart technique.

I Checkpointing strategies: Periodicand Triggered (with grace-time delay)

I System level checkpoint/restartI Evaluation of real-scale grid

scenarios 0

500000

1e+06

1.5e+06

2e+06

2.5e+06

Total Terminated Running Error Fail.NotValuable Fail.Valuable

Clu

ster

Util

isat

ion

(#Jo

bs x

#C

PU

x s

ec)

Type of jobs

Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (Triggered Checkpoints)

32nodes_cluster132nodes_cluster232nodes_cluster332nodes_cluster472nodes_cluster5

Max. util. potential for a cluster(32n) Max. util. potential for two clusters(64n)

Max. util. potential for three clusters(96n)Max. util. potential for four clusters(128n)Max. util. potential for five clusters(200n)

Lightweight Grid Graphs

Results and Future Research

Bad results for Periodic, better results for Triggered:I Big overhead of system/level checkpointing. Best approach: application-level

checkpointingI Calculation of grace-time delay by pre-estimating the checkpointing time.

Real-scale grid experimentation: complicated and energy consuming.I Simulated and virtualized experiments before the real-scale experimentation 52 / 68

Approaches for System Exploitation Improvements Energy Efficient Management techniques

Plan

Introduction






53 / 68


Motivations for Energy Efficient RJMS

ProblematicI Energy consumption increases with

cluster size.I Idle machines are waste of energy.

Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark

010

0020

0030

0040

0050

0060

0070

0080

00

Con

sum

ptio

n [W

att]

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]


Total Energy Consumed:

NORMAL Management 53.617888025 KWatts.hourGREEN Management 47.538609925 KWatts.hour

Total Execution Time:

NORMAL 26557GREEN 26638

Energy Efficiency Graphs

Developments and ExperimentationI PowerOFF PCs while not in use. PowerON when needed.I PowerSaving jobs that enable DVFS techniques: HDD spindown (sdparm) and

Cpu Frequency scaling (cpufreq)I Consider Trade-OFFs: Energy Consumption VS Jobs Waiting Time or

Application PerformanceI Real Scale experimentation using Grid5000 and the Green-NET framework

(wattmeters+RRD).


Green Management with Power ON/OFF model:I In a cluster with 89.7% utilization managed to have 10.1% gain

of energy consumption.I Increase of jobs waiting times can be treated by keeping alive

nodes for fast-response to interactive jobs or workload profilingand prediction.

Power jobs with DVFS techniques:I NAS benchmarks for DVFS techniques evaluation.I Good TradeOFFs but not for all benchmarks. Application

profiling and DVFS triggering whenever needed.

54 / 68


Motivations for Energy Efficient RJMS

ProblematicI Energy consumption increases with

cluster size.I Idle machines are waste of energy.


010

0020

0030

0040

0050

0060

0070

0080

00

Con

sum

ptio

n [W

att]

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]






Energy Efficiency Graphs

Developments and ExperimentationI PowerOFF PCs while not in use. PowerON when needed.I PowerSaving jobs that enable DVFS techniques: HDD spindown (sdparm) and

Cpu Frequency scaling (cpufreq)I Consider Trade-OFFs: Energy Consumption VS Jobs Waiting Time or

Application PerformanceI Real Scale experimentation using Grid5000 and the Green-NET framework

(wattmeters+RRD).


Green Management with Power ON/OFF model:I In a cluster with 89.7% utilization managed to have 10.1% gain

of energy consumption.I Increase of jobs waiting times can be treated by keeping alive

nodes for fast-response to interactive jobs or workload profilingand prediction.

Power jobs with DVFS techniques:I NAS benchmarks for DVFS techniques evaluation.I Good TradeOFFs but not for all benchmarks. Application

profiling and DVFS triggering whenever needed.

55 / 68


Plan

Introduction






56 / 68


ConclusionsRJMS importance in HPC stack

I More complex internal functions to deal with technological evolutions andapplications needs

I Efficient Resource and Job Management became a multifacet problem: Systemutilization, application performance through constraints, energy consumption...

I RJMS different approaches like: OAR with versatility, SLURM with scalability

Thesis Contributions:I RJMS Quantifiable functionalities evaluationsI Real-Scale Experimental Methodology for Resource and Job Management

SystemsI Controlled platforms with reproducibility and workload injection

I RJMS performance evaluation comparisonsI Taking advantage un-utilized resources for system exploitation improvements

I malleability techniques for dynamically adapted workloadI checkpointing strategies for optimization of beneficial system utilization in a

lighweight gridI energy efficient management techniques for energy reductions

57 / 68


Perspectives

Resource and Job Management Systems

... and preparation for the exascaleI Will the centralized model of the RJMS stand the scaling in number of nodes and

cores?I Partition the infrastructures and use techniques for fast jobs submission like

Falkon system [RAICU’07]I RJMS support for dynamically adapted workload solution will benefit system

utilization, fault-tolerance, energy efficiency

Real-Scale Experimentation for RJMSI Adapted versions of ESP benchmark for experimentation of different

parameters of a systemI The experimentation for this 4-year thesis is equivalent to more than 30000 days

( 82 years) of 1CPU machine usage.I Virtualization techniques (diminish the energy consumption, keeping realistic

conditions)

58 / 68


Bibliography

Raphael Bolze, Franck Cappello, Eddy Caron, Michel Dayde, Frederic Desprez, Emmanuel Jeannot,Yvon Jegou, Stephane Lanteri, Julien Leduc, Noredine Melab, Guillaume Mornet, Raymond Namyst,Pascale Primet, Benjamin Quetier, Olivier Richard, l-Ghazali Talbi, and Touche Irea.Grid’5000: a large scale and highly reconfigurable experimental grid testbed.Int. Journal of High Performance Computing Applications, 20(4):481–494, 2006.

Marcia Cera, Guilherme Pezzi, Elton Mathias, Nicolas Maillard, and Philippe Navaux.Improving the dynamic creation of processes in mpi-2.In 13th European PVMMPI Users Group Meeting, volume 4192/2006 of LNCS, pages 247–255, Bonn,Germany, 2006.

http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.

Dror G. Feitelson and Larry Rudolph.Toward convergence in job schedulers for parallel supercomputers.In Job Scheduling Strategies for Parallel Processing, pages 1–26. Springer-Verlag, 1996.

William TC Kramer.PERCU: A Holistic Method for Evaluating High Performance Computing Systems.PhD thesis, EECS Department, University of California, Berkeley, Nov 2008.

Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, and Mike Wilde.Falkon: a fast and light-weight task execution framework.In IEEE/ACM International Conference for High Performance Computing, Networking, Storage, andAnalysis (SC07, 2007.

59 / 68


60 / 68


TradeOFFs dynamically adapted workloads

Strategies Overall System Utilization Terminated Jobs Error Jobs Average Waiting Time for rigid jobsMalleable Dynamic-CPUSET Mapping 97% 8 0 81sec

Malleable Dynamic-MPI 98% 8 0 44secMoldable-besteffort 72% 5 4 8 sec

Table: Comparison of dynamic and static approaches for improvements of system exploitation

61 / 68


Energy Efficient Resource ManagementEnergy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark

010

0020

0030

0040

0050

0060

0070

0080

00

Con

sum

ptio

n [W

att]

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]






62 / 68


Energy Reductions Tradeoffs

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

CDF on Wait time with 89.62% of system utilization and NAS BT benchmark

Wait time [s]

Jobs

[%]

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

CDF on Wait time with 89.62% of system utilization and NAS BT benchmark

Wait time [s]

Jobs

[%]

GREENNORMAL

63 / 68


DVFS techniques TradeOFFs

Method HDD Spin CPU Freq HDD Spin + CPU FreqGain % Energy/Performance Energy/Performance Energy/Performance

EP 2.5% / 0% 10.3% / -18.9% 12.2% / -20.5%SP 1.6% / 0.3% 8.5% / -1.3% 10.2% / -1.5%BT 2% / -0.4% 9% / -5.4% 10.4% / -5.5%LU 2.2% / 0.2% 9.5% / -7.6% 11.5% / -10.8%CG 2% / -0.13% 8.2% / -1.4% 10% / -3.1%IS 1.4% / 1.5% 6.4% / -1.5% 10% / -7.2%

MG 1.2% / -1.1% 8.2% / -0.5% 9.8% / -3.4%Overall 1.8% / 0.05% 8.5% / -5.2% 10.5% / -7.4%

Table: Gain percentage Energy VS Performance (Execution Time)

[BCC+06],[Kra08],[fei], [FR96], [CPM+06] , [RZD+07]

Back.

64 / 68


DVFS techniques TradeOFFs

Method HDD Spin CPU Freq HDD Spin + CPU FreqGain % Energy/Performance Energy/Performance Energy/Performance

EP 2.5% / 0% 10.3% / -18.9% 12.2% / -20.5%SP 1.6% / 0.3% 8.5% / -1.3% 10.2% / -1.5%BT 2% / -0.4% 9% / -5.4% 10.4% / -5.5%LU 2.2% / 0.2% 9.5% / -7.6% 11.5% / -10.8%CG 2% / -0.13% 8.2% / -1.4% 10% / -3.1%IS 1.4% / 1.5% 6.4% / -1.5% 10% / -7.2%

MG 1.2% / -1.1% 8.2% / -0.5% 9.8% / -3.4%Overall 1.8% / 0.05% 8.5% / -5.2% 10.5% / -7.4%

Table: Gain percentage Energy VS Performance (Execution Time)

[BCC+06],[Kra08],[fei], [FR96], [CPM+06] , [RZD+07]

Back.

65 / 68


Lightweight grid experimentation

I CIGRI grid utilisation for 5 clusters of 200nodes grid, No checkpoints strategy(default)

0

500000

1e+06

1.5e+06

2e+06

2.5e+06


Clu

ster

Util

isat

ion

(#Jo

bs x

#C

PU

x s

ec)

Type of jobs

Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (No Checkpoints)




66 / 68


Lightweight grid experimentation

I CIGRI grid utilisation for 5 clusters of 200nodes grid, Triggered checkpointsstrategy

0

500000

1e+06

1.5e+06

2e+06

2.5e+06


Clu

ster

Util

isat

ion

(#Jo

bs x

#C

PU

x s

ec)

Type of jobs

Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (Triggered Checkpoints)




67 / 68


Lightweight grid experiments Results

I State of grid jobs for 5hours of experimentation of 5 clusters, 200nodes(DUALOPTERON 2.0GHz, 2GB RAM), 60% local workload

Strategies / Jobs Total Terminated Running Error Inter. Failures Inter. FailuresNot Valuable Valuable

Triggered checkpoints 1377 762 74 103 226 212No checkpoints 1420 739 74 8 581 0

Back.

68 / 68

Documents

Contributions for Resource and Job Management in High ...mescal.imag.fr/membres/yiannis.georgiou/... · I How can we evaluate the performance of a RJMS and its internal functions?