14
Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters [email protected]

Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters [email protected]

Embed Size (px)

Citation preview

Page 1: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Resource Management Analysis and AccountingMike Showerman, Mark Klein Joshi Fullop and Jeremy EnosNCSA Blue [email protected]

Page 2: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Interfaces to Accounting data

• Allocations and accounting database• Command line• Portal

• User interface

• Spreadsheets of operational metrics• For reporting to management (going away I hope)

• Integrated System Console

2

Page 3: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

3

Page 4: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Complex policies can cause confusion

• Prioritizing specific workloads leads to inefficiencies – That’s OK• We are capability job focused

• Can be challenging to determine source of utilization drop• Policy

• System issue

• Available workload

4

Page 5: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Full feature view

5

Page 6: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

More typical view

6

Page 7: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Feature descriptions

7Xtreme Accounting

Draining Missing or mismatch

Unavailable Drain Thrashing

Page 8: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Moab Torque Alps

8

Source CUG 2013 paper by Matt Ezell (ORNL)[email protected]

Allocation and accounting database

Allocation and accounting database

qsubqsub

Page 9: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Challenges in usage data:A man with one source of usage data knows his usage• MAM interface (previously gold)

• The good• Realtime logging allocation integration

• Includes reservations in accounting

• The bad• Undocumented

• No retry if data fails to send• Path goes across HSN. You would be shocked how often there is a

failure sending data to an outside server.

• Incomplete data (gres and more)• Components communicate but do not coordinate

9

Page 10: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Semi-Manual analysisAlpsevents Query

JobId Date Epoch Event ResId apid

485968 2013-11-13 23:33:37

1384407217 bound 1700 0

485968 2013-11-13 23:33:39

1384407219 placed 1700 2418935

485968 2013-11-13 23:33:40

1384407220 released 1700 2418935

485968 2013-11-13 23:33:40

1384407220 placed 1700 2418936

485968 2013-11-13 23:35:03

1384407303 released 1700 2418936

485968 2013-11-13 23:35:03

1384407303 canceled 1700 2418934

485968 2013-11-13 23:35:04

1384407304 removed 1700 2418934

10

Accounting Database

job_id login account machine group_name start_time end_time queue walldurati

on charge nodes processors qos

485968 fiedler jme nid11293 vendor_cray

11/13/2013 11:33:03

PM

11/13/2013 11:55:30

PM

normal 1099 6.11 20 640 sub_25p

shredded_job_pbsshredded_job_

pbs_idjob_i

djob_array_i

ndexhos

tqueu

e user groupname ctime qtime start end etime exit_stat

ussession

jobname owner accou

nt exec_host resources_used_vmem

resources_used_mem

resources_used_walltimeu

resources_used_nodes

resources_used_cpus

resources_used_cput

resource_list_nodes

resource_list_neednodes

resource_list_walltime

185482 485968

-1 BW normal

fiedler

vendor_cray

1384406706

1384406706

1384407217

1384407304

1384406706

0 22725

test_links

fiedler@h2ologin1

jme NodesRemoved

157659136 11030528 87 20 640 1 20:ppn=32:xe 20:ppn=32:xe 300

Page 11: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Integrated System Console

• Does a wide array of tasks• Just focusing on relevant parts

• Event an log processing ad storage engine• Trigger alters based on event templates

• Parse and store logged data• Moab/torque/alps/hsn/storage/nodes

11

Page 12: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

ISC will do more

• Integrated data give us the power to make better decisions• When alps sees more than 1 cancel… is the

filesystem down, if so, take accounting action, if not alert

• Moab time/alps time/torque time out of sync… Adjust charging?

• Filesystem issue, should walltime limits be increased?

• Hole in torus… prevent some jobs from starting?

12

Page 13: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Additional data to collect

• Where is the time really going?• Sources:moab/torque/alps issue,hsn,filesystem

• Do you account for it in detail? Suspect time?

13

Job TimeJob Time OverheadOverhead

Moab TimeMoab Time

Torque TimeTorque Time

Alps TimeAlps Time

User TimeUser Time

Page 14: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Node state Accounting

• Job failures can cause nodes to become suspect• Often very large subsets of the system

• Overhead has not been quantified

• Extension of the SDB database to trigger on state changes• Store node state change data in ISC

• Account for reduced availability• Begin collecting MTTI data

14Xtreme Accounting