14
Resource Management on Blue Waters David King, Sr. HPC Engineer [email protected] July 13, 2020 UNCLASSIFIED

Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Resource Management on Blue Waters

David King, Sr. HPC [email protected]

July 13, 2020

UNCLASSIFIED

Page 2: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Agenda

• Job Scheduling Goals• Understanding the Needs of the Users• Configuration Parameterization• Incentivizing User Behavior• Topology Awareness• Weekly Resource Management Discussion

2Managing HPC Systems and Centers

Page 3: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Job Scheduling Goals

• Know and prioritize goals to measure against• Scheduling goals are frequently in conflict: it’s a balancing act• If there is dissatisfaction with the scheduler

1. Identify if goals are being met by configuration2. Question whether the dissatisfaction is tolerable, or if goals need

adjustment3. If goals are adjusted, then adjust configuration to match and then

monitor

3Managing HPC Systems and Centers

Page 4: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Job Scheduling Goals (on Blue Waters)

• No user or project is favored by policy• Large jobs have higher priority• System is highly utilized• Job turnaround time is minimized• Debug queue with fastest turnaround time• Users can attain higher priority with higher charge to allocation• Scheduler commands are responsive• Predictable job start times

4Managing HPC Systems and Centers

Page 5: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Understanding the Needs of the Users• Evaluate requirements of users

• Wall clock time• Job turn around time• System availability• Multitenancy

• Variables Beyond Control• Job geometry (requested resources, such as walltime or nodes)• Job volume submitted• Walltime accuracy• Application stability

5Managing HPC Systems and Centers

Page 6: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Configuration Parameterization

• Identify the tools to manipulate scheduling behavior• QoS, Queues, Reservations, Fairshare

• Avoid unnecessarily complex configurations• Queues might be configured for varying:

• Priority• Time• Job size• Resource type

6Managing HPC Systems and Centers

Page 7: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Incentivizing User Behavior

• Discounts provide user incentives to encourage a submission behavior• This can be as easy as changing charge factor for a specific queue

• Examples: Seasonal submission lull, specific job sizes, preemptible queues, backfillable job

• Scheduler product built-ins will vary – custom efforts sometimes necessary

7Managing HPC Systems and Centers

Page 8: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Topology Awareness

• Placing jobs in network locations optimal for tightly coupled communication

• Can be beneficial to some applications by improving performance and runtime consistency

• Represents a constraint and can affect turnaround time• May reduce utilization, but increase overall throughput through

average performance enhancement

8Managing HPC Systems and Centers

Page 9: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Weekly Resource Management Discussion

• Review tickets submitted that are scheduling related• View storage utilization and usage• View system utilization• Look at wait times per queue and per user• Look at scheduler performance (response time)• Review for any user behavior that could potentially affect system

procedures and policy

9Managing HPC Systems and Centers

Page 10: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

10Managing HPC Systems and Centers

Filesystem Activity

Page 11: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Filesystem Load and Response Time

11Managing HPC Systems and Centers

Page 12: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

12Managing HPC Systems and Centers

Wait Times (Xdmod) and Historical Utilization

Page 13: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Scheduler Statistics and Iteration Time

13Managing HPC Systems and Centers

Page 14: Resource Management on Blue Waters · 2020. 7. 13. · •Job turnaround time is minimized •Debug queue with fastest turnaround time •Users can attain higher priority with higher

Questions?

• Email: [email protected]

14Managing HPC Systems and Centers