9
DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-10 1 Wahid Bhimji – Files access

DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

Embed Size (px)

Citation preview

Page 1: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

DPM Monitoring

Wahid BhimjiUniversity of Edinburgh,

Apr-10 1Wahid Bhimji – Files access

Page 2: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

Intro• New DPM developer Alejandro Álvarez Ayllónworking on new nagios based DPM monitoringList of Probes:https://twiki.cern.ch/twiki/bin/view/EGEE/LCGDMMonitoringBridge to examples running at CERN:http://aalvarez.web.cern.ch/aalvarez/cgi/bridge.py/gt-septic/nagios3/• He’s happy to add more probes (very responsive). He also wants

feedback on sensible WARN / FAIL values• We can also contribute in our own probes

Apr-10 Wahid Bhimji – Files access 2

Page 3: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

LCGDM plugins • Check validity of host certificates.

– check_hostcert – Warning and critical configurable: Days until the certificate expires

• DB password lifetime – check_oracle_expiration – Warning and critical configurable: Days until the password expires – Connection string, user and password can be specified

• Disk partitions activity (bytes/s in and out) – check_partition_activity – No warning or critical criteria. – Individual disks can be selected.

• CPU utilization (System/Idle/IOwait/IRQ) – check_cpu – Warning and critical configurable: Upper limit of CPU percentage per category

• Network activity: bytes/s in and out (and error percentage) – check_network – No warning or critical criteria. – Individual interfaces can be selected

• Pool free space plus filesystem status – check_dpm_pool – Warning and critical configurable: Free space per subsystem or per pool. Specified as bytes (with suffixes K,M,G,T,P). – Individual pools can be selected, but no filesystems.

Apr-10 Wahid Bhimji – Files access 3

Page 4: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

LCGDM probes cont.. • Collecting information about disk server activity (network, disk I/O, memory, number of connections) splitting

the information between sequential I/O (gridFTP and rfcp) and random I/O (rfio and xroot) – check_process Can be used for that, excepting disk I/O and network usage (apparently a kernel patch is needed for that) – Warning and critical configurable: Number of instances, % of CPU, % of memory, number of threads, number of connections,

number of file descriptors. – Individual processes can be selected.

• DPNS ping – check_dpns – Warning and critical configurable: ping time in millisecond. – Can be used remotely.

• GridFTP – check_gridftp – No warning criteria. Critical if a file can not be uploaded, downloaded, or the comparison is not successful. – Can be used remotely.

• Published information – check_dpm_infosys – No warning criteria. Critical if any of the requests information is not being published. – Can be used remotely.

• RFIO – check_rfio – Everything that applies to GridFTP probe. Can NOT be executed locally.

Apr-10 Wahid Bhimji – Files access 4

Page 5: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

From NAGIOS itself

• DB activity and size – NAGIOS: check_oracle, check_mysql

• Number of processes and threads in use – NAGIOS: check_procs (not threads, though)

• Check if filesystem correctly mounted – NAGIOS: check_disk already does this

• Disk partitions: used and free – NAGIOS: check_disk

• Memory: swap, free and used – NAGIOS: check_swap

• Load average – NAGIOS: check_load

Apr-10 Wahid Bhimji – Files access 5

Page 6: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

From grid-monitoring

• Check validity of CRLs – crls from org.sam.sec

• Check validity of CAs – check_ca_dist

• Number of sockets used for RFIO and number of sockets used for gridFTP – check_netstat.pl from Nagios Exchange can be used fot that.

• Socket count – check_netstat.pl does that and much more.

• Directory size – check_dirsize.sh may be useful.

Apr-10 Wahid Bhimji – Files access 6

Page 7: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

Apr-10 Wahid Bhimji – Files access 7

Page 8: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

Can plot stuff with pnp4nagios

Apr-10 Wahid Bhimji – Files access 8

Page 9: DPM Monitoring Wahid Bhimji University of Edinburgh, Apr-101Wahid Bhimji – Files access

Conclusions / Questions

• This is nice - Take a look at the probes and give me or Alex some feedback

• Or try it out yourself. Not tied to any releasehttp://etics-repository.cern.ch:8080/repository/pm/

volatile/repomd/name/lcgdm_head_sl5_x86_64_gcc412/index.html

• Do we want to add performance info into this?– Like what was in GridPPDPMMonitor– Summer student Martin (see DPM Stressing talk) could

_maybe_ do some of that

Apr-10 Wahid Bhimji – Files access 9