Power-Aware Job Scheduling on Production HPC Systems

RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentations.com

Power-Aware Job Scheduling on Production HPC Systems

Scheduling challenges on large-scale, production supercomputer: o Changing and conflicting scheduling goals o inaccurate job information provided by users (e.g. walltime) o System fragmentation o High cost brought by improper scheduling strategy

Software challenges on building extreme-scale supercomputers: o Reliability o I/O performance o Energy efficiency

Motivated by the above challenges, we propose to build an integrated job scheduling framework based on production Cobalt [4].

Motivations Key Observations on HPC Systems Job Scheduling with Power Budget Case Study for Blue Gene/Q Trace

Job Traces

Acknowledgments

This work is supported in part by National Science Foundation grants.

Job Power Aware Scheduling Methodology

Solution Overview

System Overview

Production Resource Management System

• Cobalt is a open source resource manager developed at Argonne National Laboratory. • Component-based architecture, written in Python Deployed on a number of Blue Gene systems. • Support full-system simulation (with Qsim [4])

Job traces used in the simulation are collected from production Blue Gene/P system named Intrepid at Argonne National Laboratory. Intrepid is a 40-rack Blue Gene/P system with 40,960 quad-core nodes. It debuted as No. 3 in TOP500 list in June 2008. Its computing nodes are connected with 3D torus network. The machine runs leadership workload. Following is a sample workload used in our simulation:

Publications [1] X.Yang, Z.Zhou, S.Wallace, Z.Lan, W.Tang, S.Coghlan and M.Papka, "Cutting Energy Costs of HPC Systems," submitted to SC'2013, 2013. [2] Z. Zhou, Z. Lan, W. Tang, and N. Desai, “Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling,” in JSSPP'13, 2013. [3] S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, “Measuring Power Consumption on IBM Blue Gene/Q,” in HPPAC’13, 2013. [4] W. Tang, D. Ren, Z. Lan, and N. Desai, ''Adaptive Metric-Aware Job Scheduling for Production Supercomputers,'' in ICPPW'12, 2012. [5] W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu, ''Reducing Fragmentation on Torus-Connected Supercomputers,'' in Proc. of IPDPS'11, 2011.

• CQSim is an event-driven simulator for HPC systems. •Trace based simulation •Support different scheduling policies.

• Job power consumption differs. 20 to 33 kW/rack for BG/P and 30 to 90 kW/rack for BG/Q. • Electricity prices vary which can change as much as factor of 10 from one hour to the next. •System utilization cannot be impacted. •Jobs appear totally different characterization on different systems and time.

Job power distribution on BG/Q

Job arrival rate of BG/Q on Dec 2012.

Methodology: o Power budget o Scheduling window o 0-1 Knapsack model o Greedy sorting o Job power profiling

Evaluation: o Energy cost saving o System utilization rate o Fairness o Average wait time

Problem statement: o How to schedule jobs with different power profiles with the goal of saving energy cost as much as possible and at the same time not affecting system utilization and not breaking the fairness of scheduling as much as possible. 0-1 Knapsack Model:

To determine a binary vector X such that

Dynamic programming:

Example of Job Power Aware Scheduling

During the month, the first half of the month were jobs for acceptance testing hence most jobs are large jobs and the second half was used for early science from users hence most jobs are small sized like 1-rack.

•Control power using during on-peak pricing period. •Save energy cost up to 23%. •Limited negative effect it brought to scheduling fairness. • System utilization rate is slight impacted. •Conduct comprehensive sensitivity study by tuning power range, pricing ratio.

Energy cost savings

System utilization

Results from ANL BG/P trace.

Results from SDSC-Blue trace.

•Energy cost savings is up to 10% using our job power aware scheduling. the monthly energy cost saving ranges from 0.5% to 10% by using Greedy, and it is from 2% to 10% by using Knapsack

•Utilization degradation is always less then 5%.

•The variation of average wait time is within 10 seconds.

We collected the job trace from the new 48-rack IBM Blue Gene/Q machine called Mira at Argonne in December of 2012. The monthly energy cost saving obtained by our design versus FCFS is 5.4% and 9.98% respectively by using 10-second scheduling interval and 30-second frequency.

Average daily power consumption

Average daily system utilization

Zhou Zhou1, Xu Yang1, Sean Wallace, Zhiling Lan1, Wei Tang2, Narayan Desai2,Susan Coghlan2 and Mike E. Papka2

1Illinois Institute of Technology, 2Argonne National Laboratory

http://www.facebook.com/pages/PosterPresentationscom/217914411419?v=app_4949752878&ref=ts�

Documents

Power-Aware Job Scheduling on Production HPC Systems