Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Contributions for Resource and Job Management inHigh Performance Computing
Yiannis Georgiou
Introduction
Introduction
High Performance Computing is defined by:
Infrastructures:I Supercomputers, Clusters, Grids,
Peer-to-Peer Systems and latelyClouds
Applications:I Climate Prediction, Protein
Folding, Crash simulation,High-Energy Physics,Astrophysics, Animation for movieand video game productions
System SoftwareI System Software: Operating
System, Runtime system, ResourceManagement, I/O Systems,Interfacing to External Environments
2 / 68
Introduction
High Performance Computing Evolutions
Computational infrastructures
Advances in computer architecture (Top 500supercomputer sites)
I multiprocessor technologies: 85%quadcore (2010) up from 1% dualcore(2005)
I Use of Clusters 80% (2010) up from 42%(2003)
I Peak performance: 1st position FromGigaFlops (1997) to Petaflops (2008)
I increase of energy consumption: average397 kWatt (2010) up from 257 kwatt (2008)
Scientific Applications
I Need for more computing power
I quality of services
I deep parallelism
I fault tolerance
What is the impact of those evolutions uponthe Resource and Job Management System?
3 / 68
Introduction
High Performance Computing Evolutions
Computational infrastructures
Advances in computer architecture (Top 500supercomputer sites)
I multiprocessor technologies: 85%quadcore (2010) up from 1% dualcore(2005)
I Use of Clusters 80% (2010) up from 42%(2003)
I Peak performance: 1st position FromGigaFlops (1997) to Petaflops (2008)
I increase of energy consumption: average397 kWatt (2010) up from 257 kwatt (2008)
Scientific Applications
I Need for more computing power
I quality of services
I deep parallelism
I fault tolerance
What is the impact of those evolutions uponthe Resource and Job Management System?
4 / 68
Introduction
Resource and Job Management
The goal of a Resource and Job Management System (RJMS) is to satisfy users’demands for computation and assign user jobs upon the computational resources in anefficient manner.
RJMS Importance
Strategic position butcomplex internals:
I Direct and constantknowledge ofresources and jobs
I Multifacetprocedures withcomplex internalfunctions
5 / 68
Introduction
Study of Resource and Job Management Systems
Studying the RJMSI How can the Resource and Job Management deal with the new technological
evolutions and the applications needs?
Experimentation Methodology to evaluate the RJMSI How can we evaluate the performance of a RJMS and its internal functions?
Motivations for system exploitation improvements
Logs of Real Parallel Workloads show important un-utilization periods (e.g.LLNL Atlas cluster with 9,216cores 64.1% of system utilization during 8months).
Improving the RJMSI How can we obtain beneficial system exploitation through a Resource and Job
Management System by taking advantage of systems un-utilized resources?
6 / 68
Introduction
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
7 / 68
Resource and Job Management Systems analysis
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
8 / 68
Resource and Job Management Systems analysis Concepts and State of the art
RJMS abstraction layers
This assignement involves three principal abstraction layers:I Job Management layer
I the Scheduling layer
I and the Resource Management layer
9 / 68
Resource and Job Management Systems analysis Concepts and State of the art
Research Challenges
Network Topological constraints
Internal Node Topological constraints
Hardware affinities: Multiprocessors, memory hierarchies...
Heterogeneity
Resources: Multicore SMPs, NUMA, GPUs and Infrastructures: Clusters,Multiclusters, Grids, Clouds, Desktop Grids...
Energy Efficiency
Needs to be taken into account: RJMS knowledge of resources and workloads
Resource ManagementI New techniques and algorithms
(e.g. Best-fit algorithms, CPUSETs, fault-tolerancetechniques,...)
Job ManagementI New concepts and expressions
(e.g. type of jobs, interfacing APIs...)
SchedulingI Larger pool of possibilities.I Needs optimized logics and algorithms.I Advanced Scheduling policies
(e.g. gang-scheduling, preemption,...)
10 / 68
Resource and Job Management Systems analysis Concepts and State of the art
Research Challenges
Network Topological constraints
Internal Node Topological constraints
Hardware affinities: Multiprocessors, memory hierarchies...
Heterogeneity
Resources: Multicore SMPs, NUMA, GPUs and Infrastructures: Clusters,Multiclusters, Grids, Clouds, Desktop Grids...
Energy Efficiency
Needs to be taken into account: RJMS knowledge of resources and workloads
Resource ManagementI New techniques and algorithms
(e.g. Best-fit algorithms, CPUSETs, fault-tolerancetechniques,...)
Job ManagementI New concepts and expressions
(e.g. type of jobs, interfacing APIs...)
SchedulingI Larger pool of possibilities.I Needs optimized logics and algorithms.I Advanced Scheduling policies
(e.g. gang-scheduling, preemption,...)
11 / 68
Resource and Job Management Systems analysis Concepts and State of the art
Research Challenges
Network Topological constraints
Internal Node Topological constraints
Hardware affinities: Multiprocessors, memory hierarchies...
Heterogeneity
Resources: Multicore SMPs, NUMA, GPUs and Infrastructures: Clusters,Multiclusters, Grids, Clouds, Desktop Grids...
Energy Efficiency
Needs to be taken into account: RJMS knowledge of resources and workloads
Resource ManagementI New techniques and algorithms
(e.g. Best-fit algorithms, CPUSETs, fault-tolerancetechniques,...)
Job ManagementI New concepts and expressions
(e.g. type of jobs, interfacing APIs...)
SchedulingI Larger pool of possibilities.I Needs optimized logics and algorithms.I Advanced Scheduling policies
(e.g. gang-scheduling, preemption,...) 12 / 68
Resource and Job Management Systems analysis RJMS different approaches
Used Resource and Job Management Systems
Open Source RJMSI SLURMI TORQUEI OARI MAUII CONDORI SGE (before Oracle)
Commercial RJMSI LoadlevelerI LSFI MOABI PBSProI OGE (Oracle Grid
Engine)
Contribution: RJMS Comparison studyI Quantifiable Functionalities Evaluation of
opensource and commercial RJMSI Experimental Performance Evaluation of
opensource RJMS
13 / 68
Resource and Job Management Systems analysis RJMS different approaches
Used Resource and Job Management Systems
Open Source RJMSI SLURMI TORQUEI OARI MAUII CONDORI SGE (before Oracle)
Commercial RJMSI LoadlevelerI LSFI MOABI PBSProI OGE (Oracle Grid
Engine)
Contribution: RJMS Comparison studyI Quantifiable Functionalities Evaluation of
opensource and commercial RJMSI Experimental Performance Evaluation of
opensource RJMS
14 / 68
Resource and Job Management Systems analysis RJMS different approaches
SLURM ...towards scalability
Designed for scalabilityI one central controller and a
daemon upon each computingnode
I daemons upon all nodes forsecure authentication (munged)
I management of resources withpartitions (groups with specificcharacteristics)
I general purpose plug-inmechanism
Special featuresI highly scalable launching and
scheduling algorithmsI best-fit topology (network and
internal node) aware schedulingI advanced scheduling policies
(preemption, gang-scheduling)I evolving jobs support (shrink in size)
FactsI Highly scalable but less flexible
to support different environments.I Used in 40% of the most powerful
production systems in TOP500.15 / 68
Resource and Job Management Systems analysis RJMS different approaches
OAR ...towards versatility
Design based on High levelcomponents
I Relational database(MySql/PostgreSQL) as the kernelto store and exchange data
I Script languages (Perl, Ruby) forthe execution engine and modules
I No deamon on computing nodes.Use of GNU/Linux components:SSH, CPUSETS, Taktuk
Special featuresI Hierarchical managing of resources
and requests expressionsI Heterogeneous resources (e.g.
licences, storage capacity, networkcapacity...)
I REST-full API for interfacing withexternal environments (ISVApplications, Web portals, Grids...)
I Multiple task types (e.g. besteffort,environment deployment, moldable)
FactsI Highly versatile with short
development cycles but notscalable enough.
I Used in production (e.g. Grid5000,Ciment) and in research (e.g.Green-Net, DSL-Lab)
16 / 68
Resource and Job Management Systems analysis Functionalities Comparison
RJMS Quantifiable Functionalities Comparison
Quantifying Functionalities support by RJMSI Resource Management
I Resources Treatment, Job Launching, Task Placement, High Availability,...
I Job ManagementI Job declaration, Job Control, Monitoring, Interfacing, Quality of Services,...
I SchedulingI Scheduling Algorithms, Queues Management, Advanced Reservations,...
Overall Evaluation / SLURM TORQUE OAR SGE MAUI LSF ...RJMS Software
Overall Resource Management (/10) 7.1 5.2 6.9 5.9 1.9 6.9 ...Overall Job Management (/10) 5.1 5.1 5.5 6 3.1 6.8 ...
Scheduling (/10) 6 3 5.7 5.7 5.5 5.7 ...Overall Evaluation Points (/10) 6.2 5.7 5.1 6 5.9 6.4 ...
17 / 68
Resource and Job Management Systems analysis Functionalities Comparison
RJMS Quantifiable Functionalities Comparison
Quantifying Functionalities support by RJMSI Resource Management
I Resources Treatment, Job Launching, Task Placement, High Availability,...
I Job ManagementI Job declaration, Job Control, Monitoring, Interfacing, Quality of Services,...
I SchedulingI Scheduling Algorithms, Queues Management, Advanced Reservations,...
Overall Evaluation / SLURM TORQUE OAR SGE MAUI LSF ...RJMS Software
Overall Resource Management (/10) 7.1 5.2 6.9 5.9 1.9 6.9 ...Overall Job Management (/10) 5.1 5.1 5.5 6 3.1 6.8 ...
Scheduling (/10) 6 3 5.7 5.7 5.5 5.7 ...Overall Evaluation Points (/10) 6.2 5.7 5.1 6 5.9 6.4 ...
18 / 68
Experimental Methodology for RJMS evaluation
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
19 / 68
Experimental Methodology for RJMS evaluation
Motivations for new experimental methodology
Experimental Methodologies for HPCI large number of parameters and conditionsI simulators or emulators valuable but not enoughI Real-scale experimentation need
Workload ModellingI Performance evaluation by executing a sequence of jobs.I Two common ways to use a workload for system evaluation.
1. Either a workload log (trace)2. or a workload model (synthetic workload like ESP)
Contribution: Developed RJMSevaluation methodology
Real-scale experimentation with workloadinjection
20 / 68
Experimental Methodology for RJMS evaluation
Motivations for new experimental methodology
Experimental Methodologies for HPCI large number of parameters and conditionsI simulators or emulators valuable but not enoughI Real-scale experimentation need
Workload ModellingI Performance evaluation by executing a sequence of jobs.I Two common ways to use a workload for system evaluation.
1. Either a workload log (trace)2. or a workload model (synthetic workload like ESP)
Contribution: Developed RJMSevaluation methodology
Real-scale experimentation with workloadinjection
21 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Experimental Methodology - General Principles
1. Real-Scale experimentation upon dedicated platforms (like Grid5000): Controland Reproduction of experiments.
2. Injection of characterised workloads (like ESP benchmark) to observe thebehaviour of the RJMS under particular conditions.
3. Extraction of the produced workload trace and post-treatment analysis of theresults.
22 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Grid5000 Experimental platform
I Grid5000 Experimental grid1 [BOLZE’06], large-scale distributed platform thatcan be easily controlled, reconfigured and monitored.
I Deep reconfiguration mechanisms sothat each user can use exactly theenvironment needed for its experiments
Applications
OS(Linux, FreeBSD,...)
EnvironmentMiddleware
Hardware Network
SpecifiableToolsDistro
Configurable
1https://www.grid5000.fr/23 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
ESP Benchmark
I provide a quantitative evaluation of launching and scheduling via a single metric [1]
I Complete independence from the hardware performance
ESPEfficiency =TheoreticDurationMeasuredDuration
(1)
Job Type Fraction of Job Size Job size for a 512cores Count of the number Target Run Timerelative to total system cluster (in cores) of job instance (Seconds)
sizeA 0.03125 16 75 267B 0.06250 32 9 322... ... ... ... ...L 0.12500 64 36 366M 0.25000 128 15 187Z 1.00000 512 2 100
Total 230
Table: ESP benchmark 2 [KRAMER’08] characteristics (case for 512 cores cluster)
2http://www.nersc.gov/projects/esp.php24 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Proposed TOPO-ESP-NAS Benchmark
I Modification of the default ESP benchmark for topology aware schedulingevaluation
I Use of NAS benchmarks: CG application most communication sensitive.I Not use of fixed target run times for the executing applicationsI Calculate the theoretic duration of [2] considering the ideal execution times for
each different class of jobs
TOPO.ESP.NASEfficiency =TheoreticDuration(IdealTopology)
MeasuredDuration(2)
25 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
26 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
27 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
28 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
29 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
Developed Xionee
https://gforge.inria.fr/projects/xionee/
Collection of tools to manage (inject, extract and analyze)workload traces
30 / 68
Experimental Methodology for RJMS evaluation Real-Scale Experimentation with Workload Injection
Steps for RJMS Experimentation
Developed Xionee
https://gforge.inria.fr/projects/xionee/
Collection of tools to manage (inject, extract and analyze)workload traces
31 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
32 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
RJMS ESP Efficiency Comparisons
Experimental Testbed detailsI Dedicated cluster 1 RJMS central controller, 8 computing nodes - 64 cores
(dualCPU-quadCORE Intel Xeon E5420 QC 2.5 GHz, 8GB RAM, Infiniband 20G).
RJMS/policy OAR SLURM TORQUE+Mauibackfill 83.7% 83.9% 83.1%
preemption Not Supported 84.9% 85.4%gang-scheduling Not Supported 94.8% Not Supported
Table: ESP benchmark Efficiency Percentage for different RJMS and scheduling policies - OAR,SLURM and Torque+Maui experiments upon a cluster of 64 resources
Results AnalysisI gang scheduling the best performance, allowing the efficient filling up of all the
’holes’ in the scheduling spaceI however this is due to the simplicity of the particular application suspend/resume
happens on memory , no swapping neededI preemption better than backfill due to the 2 higher priority ”all resources jobs”
(overhead for backfilling, direct execution for preemption). 33 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
Scaling the size of the cluster: ESP Efficiency
Experimental Testbed details
I Dedicated cluster 1 SLURM central controller, 288 computing nodes - 9216 cores(quadCPU-octoCORE Intel Xeon 7500, Infiniband).
I Part of the BULL-CEA Tera 100 cluster during its installation-testing phase (1.25 Petaflops theoreticalpeak performance).
Cluster size (Number of cores) 512 9216Average Jobs Waiting time (sec) 2766 2919
Total Workload Execution time (sec) 12992 13099ESP Efficieny for SLURM backfill+preemption policy 82.9% 82.3%
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000 12000 14000 16000
Num
ber
of C
ores
Time sec
System utilization for ESP synthetic workload and SLURM - 512 cores
SLURMJob Start Time Impulse
0
2000
4000
6000
8000
10000
0 2000 4000 6000 8000 10000 12000 14000 16000
Cor
es U
tiliz
atio
n
Time sec
System utilization for ESP synthetic workload and SLURM - 9216 cores
SLURMJob Start Time Impulse
34 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
Scaling the size of the cluster: Jobs Waiting Time
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Cumulated Distribution Function on Waiting time for ESP benchmark and SLURM backfill+preemption scheduler
Wait time [sec]
Jobs
[%]
0 1000 2000 3000 4000 5000 6000
0.0
0.2
0.4
0.6
0.8
1.0
Cumulated Distribution Function on Waiting time for ESP benchmark and SLURM backfill+preemption scheduler
Wait time [sec]
Jobs
[%]
SLURM512SLURM9216
35 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
Scaling the number of submitted jobs:Backfill VS Preemption policy
I Submission burst of small granularity jobs (1 core per job) for simple sleep executionsI 2nd case without preemption between jobsI Degradation problems with backfill+preemption
0 50 100 150 200 250 300
050
100
150
Instant Throughput for 7000 submitted jobs (1core each) upon a 10240 cores cluster (Backfill+Preemption Mode)
Time(sec)
Num
ber
of J
obs
36 / 68
Experimental Methodology for RJMS evaluation Performance Evaluation of some opensource RJMS
Network Topology Aware Placement Evaluations
Experimental Testbed details
I Dedicated cluster 1 SLURM central controller, 128 computing nodes - 4096 cores(quadCPU-octoCORE Intel Xeon 7500).
I Network topological constraints, 2 different islands, 64 nodes per island: higher bandwidth in the sameisland.
SLURM NB cores-TOPO Cons / Topo-ESP-NAS-Results Theoretic- Ideal values 4096 NO-Topology Aware 4096 Topology AwareTotal Execution time(sec) 12227 17518 16985Average Wait time(sec) - 4575 4617
Average Execution time(sec) - 502 483Efficieny for Topo-ESP-NAS 100% 69.8% 72.0%
Jobs on 1 island 228 165 183Jobs on 2 islands 2 65 47
Table: TOPO-ESP-NAS benchmark results for 4096 resources cluster
37 / 68
Approaches for System Exploitation Improvements
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
38 / 68
Approaches for System Exploitation Improvements
Motivations for exploitation of un-utilized resources
I Systems may present important under-utilization periods due to:I jobs interarrival times and cancelation rates.I the complexity of workloads, with variations in resources demands and execution
times
I Issue: The high dynamicity of users workload may result in bad schedulingperformance producing long waiting jobs, big turnaround times.
Workload Traces From Until Months CPUS Jobs Users Utilization %LANL O2K Nov 1999 Apr 2000 5 2,048 121,989 337 64.0
OSC Cluster Jan 2000 Nov 2001 22 57 80,714 254 43.1SDSC BLUE Apr 2000 Jan 2003 32 1,152 250,440 468 76.2
HPC2N Jul 2002 Jan 2006 42 240 527,371 258 72.0SDSC DataStar Mar 2004 Apr 2005 13 1,664 96,089 460 62.8
LPC EGEE Aug 2004 May 2005 9 140 244,821 57 20.8LLNL uBGL Nov 2006 Jun 2007 7 2,048 112,611 62 56.1LLNL Atlas Nov 2006 Jun 2007 8 9,216 60,332 132 64.1
LLNL Thunder Jan 2007 Jun 2007 5 4,008 128,662 283 87.6
Table: Logs of Real Parallel Workloads from Production Systems [FEITELSON’02]
39 / 68
Approaches for System Exploitation Improvements
Taking advantage of the clusters unutilized resources...
1... through dynamic jobsI Adapting dynamic jobs to resources availabilities
2... through a Lightweight GridI Aggregating unutilized cluster resources for use in global computing
3... for energy efficiencyI Allowing unutilized resources to be powered off in order to perform clusters
energy reductions.
40 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
41 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Background Information
Who decides When decidedAt submission During Execution
Application Rigid EvolvingSystem Moldable Malleable
Table: Classification of parallel applications [FEITELSON’96]
Motivations for malleability support upon an RJMSI a lot of research made in theory
I Programming Dynamic ApplicationsI RJMS and DynamicityI Communication protocols between RJMS and Applications
I ...few adaptive parallel applications exist in real worldI and limited RJMS support for dynamic applications
42 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Background Information
Who decides When decidedAt submission During Execution
Application Rigid EvolvingSystem Moldable Malleable
Table: Classification of parallel applications [FEITELSON’96]
Motivations for malleability support upon an RJMSI a lot of research made in theory
I Programming Dynamic ApplicationsI RJMS and DynamicityI Communication protocols between RJMS and Applications
I ...few adaptive parallel applications exist in real worldI and limited RJMS support for dynamic applications
43 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Implementation detailsMalleability support implementation upon OAR RJMS
I the Malleability automaton for decision making
I a Resource Discovery command
I a Communication Protocol
I scheduling of only one malleable job is allowed at a time
I a Normal job for the rigid part to guarantee termination
I the Malleability worker for the management of dynamic operations
I Best Effort jobs for the adaptability and non interference to the cluster
Malleability support
One architecture, twotechniques:
I Dynamic MPI
I Dynamic CPUSETMapping
44 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Dynamic MPI technique
Growing Shrinking
CLUSTERNODE 1 NODE 2 NODE 3 NODE 4
Application Process
Rigid job (malleable)
Best Effort job (malleable)
Rigid job (external)
CORE 1
CORE 2
CORE 3
CORE 4
Details
I supports only malleable applicationswith MPI Comm spawn primitives ofMPI-2 norm
I lib-dynamicMPI [CERA’06] libraryfor communication between theMPI Comm spawn calls and OAR
I LAM-MPI implementation of MPI-2(lamgrow/lamshrink)
45 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Dynamic CPUSET Mapping
Growing Shrinking
NODECORE 1 CORE 2 CORE 3 CORE 4
Application Process
Rigid job (malleable)
Best Effort job (malleable)
Rigid job (external)
Details
I CPUSETs Linux kernel objects forpartitioning multiprocessor machineby creating execution areas.
I on-the-fly manipulation of theamount of allocated cores per node.
I functions with every parallelapplication
46 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Real-Scale Experimentation Testbed
Infrastructure and Applications Characteristics
Infrastructure:
I Dedicated Cluster (Grid5000): 1 OAR server and 32 computing nodes (IBM System x3455 composedby DualCPU-DualCore AMD Opteron 2218 (2.6 GHz/2 MB L2 cache/800 MHz) and 4GB memory pernode with Gigabit Ethernet).
Applications:
I Mandelbrot reprogrammed in malleable using MPI Comm spawn primitives of MPI-2 norm forDynamic MPI.
I NAS class C benchmarks with their MPI3.3 implementation: BT (Block Tridiagonal Solver) and CG(Conjugate Gradient) for Dynamic CPUSETs Mapping
Experiment Details
I Workload Traces Injection from DAS-2 clusters (slice of 5 hours with 40% cluster utilization)representing the normal workload of the cluster.
I One malleable job per time is submitted and will run upon the free resources, i.e. those that are notused by the normal workload.
I Dynamic Malleable approaches Comparison with static moldable-besteffort jobs submission(moldable job that starts its execution upon all free cluster resources and remains without changesuntil its execution ending).
47 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Malleability evaluation
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
# of
cor
es u
sed
Time of simulation (5 hours)
malleableworkload
Figure: Malleable jobs executing BTapplication upon the free resources of thenormal workload.
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
# of
cor
es u
sed
Time of simulation (5 hours)
moldable−besteffortworkload
Figure: Moldable-besteffort jobs executing BTapplication upon the free resources of thenormal workload.
Results in percentages
Normal workload utilization (injected): 40%Malleable jobs utilization (measured): 57%Moldable-besteffort jobs utilization (measured) : 32%Gain : 25% of system utilization
Future WorkI Extend the prototype to support multiple
malleable jobsI Deeper experimentation of the shared memory
contexts for the CPUSET Mapping
48 / 68
Approaches for System Exploitation Improvements Optimizing resources exploitation with Malleability techniques
Malleability evaluation
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
# of
cor
es u
sed
Time of simulation (5 hours)
malleableworkload
Figure: Malleable jobs executing BTapplication upon the free resources of thenormal workload.
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
# of
cor
es u
sed
Time of simulation (5 hours)
moldable−besteffortworkload
Figure: Moldable-besteffort jobs executing BTapplication upon the free resources of thenormal workload.
Results in percentages
Normal workload utilization (injected): 40%Malleable jobs utilization (measured): 57%Moldable-besteffort jobs utilization (measured) : 32%Gain : 25% of system utilization
Future WorkI Extend the prototype to support multiple
malleable jobsI Deeper experimentation of the shared memory
contexts for the CPUSET Mapping
49 / 68
Approaches for System Exploitation Improvements Improving system utilization in a lightweight grid context
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
50 / 68
Approaches for System Exploitation Improvements Improving system utilization in a lightweight grid context
Motivations for system utilization improvements
Context
I Lightweight grid CIGRI aggregation of clustersun-utilized resources for bag-of-tasksapplications through local OAR Besteffort jobs
Related Work
I different from mainstream: Grids like Globus...
I similar to alternative: Desktop Grids-VolunteerComputing like Condor, OurGrid, Boinc(Seti@home)...
Problematic
I Improving system utilization (more jobs can beexecuted in case of un-utilized resources)...
I but resources volatility...
I turn useless jobs computation when killed...they restart from the begginning
51 / 68
Approaches for System Exploitation Improvements Improving system utilization in a lightweight grid context
Developments and Future Research
DevelopmentsI Improve beneficial system
utilization throughCheckpoint/Restart technique.
I Checkpointing strategies: Periodicand Triggered (with grace-time delay)
I System level checkpoint/restartI Evaluation of real-scale grid
scenarios 0
500000
1e+06
1.5e+06
2e+06
2.5e+06
Total Terminated Running Error Fail.NotValuable Fail.Valuable
Clu
ster
Util
isat
ion
(#Jo
bs x
#C
PU
x s
ec)
Type of jobs
Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (Triggered Checkpoints)
32nodes_cluster132nodes_cluster232nodes_cluster332nodes_cluster472nodes_cluster5
Max. util. potential for a cluster(32n) Max. util. potential for two clusters(64n)
Max. util. potential for three clusters(96n)Max. util. potential for four clusters(128n)Max. util. potential for five clusters(200n)
Lightweight Grid Graphs
Results and Future Research
Bad results for Periodic, better results for Triggered:I Big overhead of system/level checkpointing. Best approach: application-level
checkpointingI Calculation of grace-time delay by pre-estimating the checkpointing time.
Real-scale grid experimentation: complicated and energy consuming.I Simulated and virtualized experiments before the real-scale experimentation 52 / 68
Approaches for System Exploitation Improvements Energy Efficient Management techniques
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
53 / 68
Approaches for System Exploitation Improvements Energy Efficient Management techniques
Motivations for Energy Efficient RJMS
ProblematicI Energy consumption increases with
cluster size.I Idle machines are waste of energy.
Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
010
0020
0030
0040
0050
0060
0070
0080
00
Con
sum
ptio
n [W
att]
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]
Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
Total Energy Consumed:
NORMAL Management 53.617888025 KWatts.hourGREEN Management 47.538609925 KWatts.hour
Total Execution Time:
NORMAL 26557GREEN 26638
Energy Efficiency Graphs
Developments and ExperimentationI PowerOFF PCs while not in use. PowerON when needed.I PowerSaving jobs that enable DVFS techniques: HDD spindown (sdparm) and
Cpu Frequency scaling (cpufreq)I Consider Trade-OFFs: Energy Consumption VS Jobs Waiting Time or
Application PerformanceI Real Scale experimentation using Grid5000 and the Green-NET framework
(wattmeters+RRD).
Results and Future Research
Green Management with Power ON/OFF model:I In a cluster with 89.7% utilization managed to have 10.1% gain
of energy consumption.I Increase of jobs waiting times can be treated by keeping alive
nodes for fast-response to interactive jobs or workload profilingand prediction.
Power jobs with DVFS techniques:I NAS benchmarks for DVFS techniques evaluation.I Good TradeOFFs but not for all benchmarks. Application
profiling and DVFS triggering whenever needed.
54 / 68
Approaches for System Exploitation Improvements Energy Efficient Management techniques
Motivations for Energy Efficient RJMS
ProblematicI Energy consumption increases with
cluster size.I Idle machines are waste of energy.
Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
010
0020
0030
0040
0050
0060
0070
0080
00
Con
sum
ptio
n [W
att]
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]
Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
Total Energy Consumed:
NORMAL Management 53.617888025 KWatts.hourGREEN Management 47.538609925 KWatts.hour
Total Execution Time:
NORMAL 26557GREEN 26638
Energy Efficiency Graphs
Developments and ExperimentationI PowerOFF PCs while not in use. PowerON when needed.I PowerSaving jobs that enable DVFS techniques: HDD spindown (sdparm) and
Cpu Frequency scaling (cpufreq)I Consider Trade-OFFs: Energy Consumption VS Jobs Waiting Time or
Application PerformanceI Real Scale experimentation using Grid5000 and the Green-NET framework
(wattmeters+RRD).
Results and Future Research
Green Management with Power ON/OFF model:I In a cluster with 89.7% utilization managed to have 10.1% gain
of energy consumption.I Increase of jobs waiting times can be treated by keeping alive
nodes for fast-response to interactive jobs or workload profilingand prediction.
Power jobs with DVFS techniques:I NAS benchmarks for DVFS techniques evaluation.I Good TradeOFFs but not for all benchmarks. Application
profiling and DVFS triggering whenever needed.
55 / 68
Conclusions and Perspectives
Plan
Introduction
Resource and Job Management Systems analysisConcepts and State of the artRJMS different approachesFunctionalities Comparison
Experimental Methodology for RJMS evaluationReal-Scale Experimentation with Workload InjectionPerformance Evaluation of some opensource RJMS
Approaches for System Exploitation ImprovementsOptimizing resources exploitation with Malleability techniquesImproving system utilization in a lightweight grid contextEnergy Efficient Management techniques
Conclusions and Perspectives
Appendix - References
56 / 68
Conclusions and Perspectives
ConclusionsRJMS importance in HPC stack
I More complex internal functions to deal with technological evolutions andapplications needs
I Efficient Resource and Job Management became a multifacet problem: Systemutilization, application performance through constraints, energy consumption...
I RJMS different approaches like: OAR with versatility, SLURM with scalability
Thesis Contributions:I RJMS Quantifiable functionalities evaluationsI Real-Scale Experimental Methodology for Resource and Job Management
SystemsI Controlled platforms with reproducibility and workload injection
I RJMS performance evaluation comparisonsI Taking advantage un-utilized resources for system exploitation improvements
I malleability techniques for dynamically adapted workloadI checkpointing strategies for optimization of beneficial system utilization in a
lighweight gridI energy efficient management techniques for energy reductions
57 / 68
Conclusions and Perspectives
Perspectives
Resource and Job Management Systems
... and preparation for the exascaleI Will the centralized model of the RJMS stand the scaling in number of nodes and
cores?I Partition the infrastructures and use techniques for fast jobs submission like
Falkon system [RAICU’07]I RJMS support for dynamically adapted workload solution will benefit system
utilization, fault-tolerance, energy efficiency
Real-Scale Experimentation for RJMSI Adapted versions of ESP benchmark for experimentation of different
parameters of a systemI The experimentation for this 4-year thesis is equivalent to more than 30000 days
( 82 years) of 1CPU machine usage.I Virtualization techniques (diminish the energy consumption, keeping realistic
conditions)
58 / 68
Appendix - References
Bibliography
Raphael Bolze, Franck Cappello, Eddy Caron, Michel Dayde, Frederic Desprez, Emmanuel Jeannot,Yvon Jegou, Stephane Lanteri, Julien Leduc, Noredine Melab, Guillaume Mornet, Raymond Namyst,Pascale Primet, Benjamin Quetier, Olivier Richard, l-Ghazali Talbi, and Touche Irea.Grid’5000: a large scale and highly reconfigurable experimental grid testbed.Int. Journal of High Performance Computing Applications, 20(4):481–494, 2006.
Marcia Cera, Guilherme Pezzi, Elton Mathias, Nicolas Maillard, and Philippe Navaux.Improving the dynamic creation of processes in mpi-2.In 13th European PVMMPI Users Group Meeting, volume 4192/2006 of LNCS, pages 247–255, Bonn,Germany, 2006.
http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.
Dror G. Feitelson and Larry Rudolph.Toward convergence in job schedulers for parallel supercomputers.In Job Scheduling Strategies for Parallel Processing, pages 1–26. Springer-Verlag, 1996.
William TC Kramer.PERCU: A Holistic Method for Evaluating High Performance Computing Systems.PhD thesis, EECS Department, University of California, Berkeley, Nov 2008.
Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, and Mike Wilde.Falkon: a fast and light-weight task execution framework.In IEEE/ACM International Conference for High Performance Computing, Networking, Storage, andAnalysis (SC07, 2007.
59 / 68
Appendix - References
60 / 68
Appendix - References
TradeOFFs dynamically adapted workloads
Strategies Overall System Utilization Terminated Jobs Error Jobs Average Waiting Time for rigid jobsMalleable Dynamic-CPUSET Mapping 97% 8 0 81sec
Malleable Dynamic-MPI 98% 8 0 44secMoldable-besteffort 72% 5 4 8 sec
Table: Comparison of dynamic and static approaches for improvements of system exploitation
61 / 68
Appendix - References
Energy Efficient Resource ManagementEnergy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
010
0020
0030
0040
0050
0060
0070
0080
00
Con
sum
ptio
n [W
att]
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000Time [s]
Energy consumption of trace file execution with 89.62% of system utilization and NAS BT benchmark
Total Energy Consumed:
NORMAL Management 53.617888025 KWatts.hourGREEN Management 47.538609925 KWatts.hour
Total Execution Time:
NORMAL 26557GREEN 26638
62 / 68
Appendix - References
Energy Reductions Tradeoffs
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
CDF on Wait time with 89.62% of system utilization and NAS BT benchmark
Wait time [s]
Jobs
[%]
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
CDF on Wait time with 89.62% of system utilization and NAS BT benchmark
Wait time [s]
Jobs
[%]
GREENNORMAL
63 / 68
Appendix - References
DVFS techniques TradeOFFs
Method HDD Spin CPU Freq HDD Spin + CPU FreqGain % Energy/Performance Energy/Performance Energy/Performance
EP 2.5% / 0% 10.3% / -18.9% 12.2% / -20.5%SP 1.6% / 0.3% 8.5% / -1.3% 10.2% / -1.5%BT 2% / -0.4% 9% / -5.4% 10.4% / -5.5%LU 2.2% / 0.2% 9.5% / -7.6% 11.5% / -10.8%CG 2% / -0.13% 8.2% / -1.4% 10% / -3.1%IS 1.4% / 1.5% 6.4% / -1.5% 10% / -7.2%
MG 1.2% / -1.1% 8.2% / -0.5% 9.8% / -3.4%Overall 1.8% / 0.05% 8.5% / -5.2% 10.5% / -7.4%
Table: Gain percentage Energy VS Performance (Execution Time)
[BCC+06],[Kra08],[fei], [FR96], [CPM+06] , [RZD+07]
Back.
64 / 68
Appendix - References
DVFS techniques TradeOFFs
Method HDD Spin CPU Freq HDD Spin + CPU FreqGain % Energy/Performance Energy/Performance Energy/Performance
EP 2.5% / 0% 10.3% / -18.9% 12.2% / -20.5%SP 1.6% / 0.3% 8.5% / -1.3% 10.2% / -1.5%BT 2% / -0.4% 9% / -5.4% 10.4% / -5.5%LU 2.2% / 0.2% 9.5% / -7.6% 11.5% / -10.8%CG 2% / -0.13% 8.2% / -1.4% 10% / -3.1%IS 1.4% / 1.5% 6.4% / -1.5% 10% / -7.2%
MG 1.2% / -1.1% 8.2% / -0.5% 9.8% / -3.4%Overall 1.8% / 0.05% 8.5% / -5.2% 10.5% / -7.4%
Table: Gain percentage Energy VS Performance (Execution Time)
[BCC+06],[Kra08],[fei], [FR96], [CPM+06] , [RZD+07]
Back.
65 / 68
Appendix - References
Lightweight grid experimentation
I CIGRI grid utilisation for 5 clusters of 200nodes grid, No checkpoints strategy(default)
0
500000
1e+06
1.5e+06
2e+06
2.5e+06
Total Terminated Running Error Fail.NotValuable Fail.Valuable
Clu
ster
Util
isat
ion
(#Jo
bs x
#C
PU
x s
ec)
Type of jobs
Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (No Checkpoints)
32nodes_cluster132nodes_cluster232nodes_cluster332nodes_cluster472nodes_cluster5
Max. util. potential for a cluster(32n) Max. util. potential for two clusters(64n)
Max. util. potential for three clusters(96n)Max. util. potential for four clusters(128n)Max. util. potential for five clusters(200n)
66 / 68
Appendix - References
Lightweight grid experimentation
I CIGRI grid utilisation for 5 clusters of 200nodes grid, Triggered checkpointsstrategy
0
500000
1e+06
1.5e+06
2e+06
2.5e+06
Total Terminated Running Error Fail.NotValuable Fail.Valuable
Clu
ster
Util
isat
ion
(#Jo
bs x
#C
PU
x s
ec)
Type of jobs
Grid utilisation of CIGRI besteffort jobs for 5h and 60% local cluster workload (Triggered Checkpoints)
32nodes_cluster132nodes_cluster232nodes_cluster332nodes_cluster472nodes_cluster5
Max. util. potential for a cluster(32n) Max. util. potential for two clusters(64n)
Max. util. potential for three clusters(96n)Max. util. potential for four clusters(128n)Max. util. potential for five clusters(200n)
67 / 68
Appendix - References
Lightweight grid experiments Results
I State of grid jobs for 5hours of experimentation of 5 clusters, 200nodes(DUALOPTERON 2.0GHz, 2GB RAM), 60% local workload
Strategies / Jobs Total Terminated Running Error Inter. Failures Inter. FailuresNot Valuable Valuable
Triggered checkpoints 1377 762 74 103 226 212No checkpoints 1420 739 74 8 581 0
Back.
68 / 68