Upload
flora-scott
View
215
Download
3
Tags:
Embed Size (px)
Citation preview
Resource Management Analysis and AccountingMike Showerman, Mark Klein Joshi Fullop and Jeremy EnosNCSA Blue [email protected]
Interfaces to Accounting data
• Allocations and accounting database• Command line• Portal
• User interface
• Spreadsheets of operational metrics• For reporting to management (going away I hope)
• Integrated System Console
2
3
Complex policies can cause confusion
• Prioritizing specific workloads leads to inefficiencies – That’s OK• We are capability job focused
• Can be challenging to determine source of utilization drop• Policy
• System issue
• Available workload
4
Full feature view
5
More typical view
6
Feature descriptions
7Xtreme Accounting
Draining Missing or mismatch
Unavailable Drain Thrashing
Moab Torque Alps
8
Source CUG 2013 paper by Matt Ezell (ORNL)[email protected]
Allocation and accounting database
Allocation and accounting database
qsubqsub
Challenges in usage data:A man with one source of usage data knows his usage• MAM interface (previously gold)
• The good• Realtime logging allocation integration
• Includes reservations in accounting
• The bad• Undocumented
• No retry if data fails to send• Path goes across HSN. You would be shocked how often there is a
failure sending data to an outside server.
• Incomplete data (gres and more)• Components communicate but do not coordinate
9
Semi-Manual analysisAlpsevents Query
JobId Date Epoch Event ResId apid
485968 2013-11-13 23:33:37
1384407217 bound 1700 0
485968 2013-11-13 23:33:39
1384407219 placed 1700 2418935
485968 2013-11-13 23:33:40
1384407220 released 1700 2418935
485968 2013-11-13 23:33:40
1384407220 placed 1700 2418936
485968 2013-11-13 23:35:03
1384407303 released 1700 2418936
485968 2013-11-13 23:35:03
1384407303 canceled 1700 2418934
485968 2013-11-13 23:35:04
1384407304 removed 1700 2418934
10
Accounting Database
job_id login account machine group_name start_time end_time queue walldurati
on charge nodes processors qos
485968 fiedler jme nid11293 vendor_cray
11/13/2013 11:33:03
PM
11/13/2013 11:55:30
PM
normal 1099 6.11 20 640 sub_25p
shredded_job_pbsshredded_job_
pbs_idjob_i
djob_array_i
ndexhos
tqueu
e user groupname ctime qtime start end etime exit_stat
ussession
jobname owner accou
nt exec_host resources_used_vmem
resources_used_mem
resources_used_walltimeu
resources_used_nodes
resources_used_cpus
resources_used_cput
resource_list_nodes
resource_list_neednodes
resource_list_walltime
185482 485968
-1 BW normal
fiedler
vendor_cray
1384406706
1384406706
1384407217
1384407304
1384406706
0 22725
test_links
fiedler@h2ologin1
jme NodesRemoved
157659136 11030528 87 20 640 1 20:ppn=32:xe 20:ppn=32:xe 300
Integrated System Console
• Does a wide array of tasks• Just focusing on relevant parts
• Event an log processing ad storage engine• Trigger alters based on event templates
• Parse and store logged data• Moab/torque/alps/hsn/storage/nodes
11
ISC will do more
• Integrated data give us the power to make better decisions• When alps sees more than 1 cancel… is the
filesystem down, if so, take accounting action, if not alert
• Moab time/alps time/torque time out of sync… Adjust charging?
• Filesystem issue, should walltime limits be increased?
• Hole in torus… prevent some jobs from starting?
12
Additional data to collect
• Where is the time really going?• Sources:moab/torque/alps issue,hsn,filesystem
• Do you account for it in detail? Suspect time?
13
Job TimeJob Time OverheadOverhead
Moab TimeMoab Time
Torque TimeTorque Time
Alps TimeAlps Time
User TimeUser Time
Node state Accounting
• Job failures can cause nodes to become suspect• Often very large subsets of the system
• Overhead has not been quantified
• Extension of the SDB database to trigger on state changes• Store node state change data in ISC
• Account for reduced availability• Begin collecting MTTI data
14Xtreme Accounting