Upload
sara-macleod
View
217
Download
3
Embed Size (px)
Citation preview
DPM Monitoring
Wahid BhimjiUniversity of Edinburgh,
Apr-10 1Wahid Bhimji – Files access
Intro• New DPM developer Alejandro Álvarez Ayllónworking on new nagios based DPM monitoringList of Probes:https://twiki.cern.ch/twiki/bin/view/EGEE/LCGDMMonitoringBridge to examples running at CERN:http://aalvarez.web.cern.ch/aalvarez/cgi/bridge.py/gt-septic/nagios3/• He’s happy to add more probes (very responsive). He also wants
feedback on sensible WARN / FAIL values• We can also contribute in our own probes
Apr-10 Wahid Bhimji – Files access 2
LCGDM plugins • Check validity of host certificates.
– check_hostcert – Warning and critical configurable: Days until the certificate expires
• DB password lifetime – check_oracle_expiration – Warning and critical configurable: Days until the password expires – Connection string, user and password can be specified
• Disk partitions activity (bytes/s in and out) – check_partition_activity – No warning or critical criteria. – Individual disks can be selected.
• CPU utilization (System/Idle/IOwait/IRQ) – check_cpu – Warning and critical configurable: Upper limit of CPU percentage per category
• Network activity: bytes/s in and out (and error percentage) – check_network – No warning or critical criteria. – Individual interfaces can be selected
• Pool free space plus filesystem status – check_dpm_pool – Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P). – Individual pools can be selected, but no filesystems.
Apr-10 Wahid Bhimji – Files access 3
LCGDM probes cont.. • Collecting information about disk server activity (network, disk I/O, memory, number of connections) splitting
the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot) – check_process Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for that) – Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of connections,
number of file descriptors. – Individual processes can be selected.
• DPNS ping – check_dpns – Warning and critical configurable: ping time in millisecond. – Can be used remotely.
• GridFTP – check_gridftp – No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful. – Can be used remotely.
• Published information – check_dpm_infosys – No warning criteria. Critical if any of the requests information is not being published. – Can be used remotely.
• RFIO – check_rfio – Everything that applies to GridFTP probe. Can NOT be executed locally.
Apr-10 Wahid Bhimji – Files access 4
From NAGIOS itself
• DB activity and size – NAGIOS: check_oracle, check_mysql
• Number of processes and threads in use – NAGIOS: check_procs (not threads, though)
• Check if filesystem correctly mounted – NAGIOS: check_disk already does this
• Disk partitions: used and free – NAGIOS: check_disk
• Memory: swap, free and used – NAGIOS: check_swap
• Load average – NAGIOS: check_load
Apr-10 Wahid Bhimji – Files access 5
From grid-monitoring
• Check validity of CRLs – crls from org.sam.sec
• Check validity of CAs – check_ca_dist
• Number of sockets used for RFIO and number of sockets used for gridFTP – check_netstat.pl from Nagios Exchange can be used fot that.
• Socket count – check_netstat.pl does that and much more.
• Directory size – check_dirsize.sh may be useful.
Apr-10 Wahid Bhimji – Files access 6
Apr-10 Wahid Bhimji – Files access 7
Can plot stuff with pnp4nagios
Apr-10 Wahid Bhimji – Files access 8
Conclusions / Questions
• This is nice - Take a look at the probes and give me or Alex some feedback
• Or try it out yourself. Not tied to any releasehttp://etics-repository.cern.ch:8080/repository/pm/
volatile/repomd/name/lcgdm_head_sl5_x86_64_gcc412/index.html
• Do we want to add performance info into this?– Like what was in GridPPDPMMonitor– Summer student Martin (see DPM Stressing talk) could
_maybe_ do some of that
Apr-10 Wahid Bhimji – Files access 9