14
www.eu-eela.eu E-science grid facility for Europe and Latin America Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009

Www.eu-eela.eu E-science grid facility for Europe and Latin America Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno,

Embed Size (px)

Citation preview

www.eu-eela.eu

E-science grid facility forEurope and Latin America

Watchdog: A job monitoring solution inside the EELA-2 Infrastructure

Riccardo Bruno, Roberto Barbera, Elisa Ingrà

INFN Sez. Catania (Italy)

2nd EELA-2 Conference

Choroni (Venezuela), 25-27.11.2009

www.eu-eela.eu

Job Monitoring in gLite

Before gLite v3.1 no job monitoring systems were available

• Jobs running into the WNs are considered as Black Boxes• No prompted job status retrieval (Done/Abort/…)• Output Sandbox available only after WMS recognize job completion

• This situation was not good for jobs requesting very long computational time.

2 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

Jobs

WMS

CE

CE

CE

WNs

WN

?Output

SandBox

www.eu-eela.eu

Analysis• Need

– Get in touch with the jobs running into the WN (especially for long term jobs) monitoring and controlling their execution.

• How– Perform job control and monitoring using grid services in the less

invasive way for the application.

• Observations– Almost all Grid jobs are piloted by a main shell script:

Get precious info in case of faults Pilot complex batch workflows

– Both AMGA and SE+LFC can be used as a basic Grid Info System lfc-* and lcg-* tools already available for Grid file management mdcli AMGA command can be used by jobs on the WNs cp command in case of shared file system on the WN The latency of CLI tools is very low compared to long term jobs

3 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

www.eu-eela.eu

Requirements

• Monitor job execution timely watching files produced by the job while it executes on the WN– File snapshots will be reported on LFC+SE, AMGA servers or

mounted shared FSs

• It would be useful to configure the monitoring tool accordingly to the user needs– The monitoring tool will consist only of bash script files– Few shell environment variables can be used to configure

the monitoring behavior

• Control the job execution accessing directly on the WN– It is possible to send user commands on the WN– It is possible to change the monitoring while the Grid job runs

4 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

www.eu-eela.eu

The Watchdog• The Watchdog consists of set of shell scripts to be included in the

JDL InputSandbox and then called by the pilot script.• Watchdog features:

– It starts in background before to run the Grid job

– The watchdog runs as long as the main job

– The monitoring process can be piloted until the pilot scripthas not finished

– Easily configurable and customizable

– The watchdog does not compromise the CPU power of the WN– The watchdog can be used with MPI jobs– Files may be fully or partially reported (only last changes)

5 Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

www.eu-eela.eu

WD Main Components• watchdog.sh

– The WD core main script, it is the responsible of the job monitoring file snapshot reporting and user command execution

• watchdog.ctrl– This script controls the execution of the WD core script; it can:

start, stop, pause and resume the WD.It can be also used to: alter the time interval add/remove files to watch and change reporting strategy (full/partial)

• watchdog.conf– This script contains all environment variables needed to

configure the WD–

The use of AMGA reporting requires more files

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 6

www.eu-eela.eu

WD Additional Components

• getinfo.sh / setinfo.shgetcontent.sh / setcontent.sh (AMGA)

– Utilities to set/get WD reported information from/to AMGA metadata catalog

• uuencode / uudecode (shareutils) (AMGA)

– Executables needed by WD to encode binaries and multiline text content into the AMGA metadata catalog in Base64 text format.

– In EELA-2 (prod VO) available into: $VO_PROD_VO_EU_EELA_EU_SW_DIR

• wdcli– CLI application to let the user interact with the WD

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 7

www.eu-eela.eu

WD Usage

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 8

1. Configure the Watchdog setting the watchdog.conf file

2. Applications using Watchdog MUST include the files: watchdog.sh, watchdog.ctrl, watchdog.conf,uuencode,uudecode (in case of AMGA reporting) or configure the PATH VO_PROD_VO_EU_EELA_EU_SW_DIR in the WN

3. Call the watchdog.ctrl into the pilot script

Type = "Job";

JobType = "Normal";

Executable = "/bin/bash";

StdOutput = "file.out";

StdError = "file.err";

InputSandbox = {"watchdog.sh", "watchdog.ctrl", "watchdog.conf","uuencode", "uudecode", "AppPilotScript.sh"};

OutputSandbox = {"MyApp.out","MyApp.err", "watchdog.log”,"watchdog.err"};

Arguments = "AppPilotScript.sh";

App JDL

#!/bin/sh

# prepare and start the watchdog

PATH=${VO_PROD_VO_EU_EELA_EU_SW_DIR}\/:${PATH}:.

chmod +x watchdog.*./watchdog.ctrl start

#run application …

# Use the ./watchdog.ctrl

# to control the WD anytime

#stop and wait the watchdog completes

./watchdog.ctrl stop

AppPiloyScript.sh

www.eu-eela.eu

WD Interaction

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 9

<BASEPATH>/6-tPC2d2knO7m6GP2XC7-Q

_watchdog/

091002232421_wdcli_cmd1.cmd

091002232421_wdcli_cmd1.err

091002232421_wdcli_cmd1.out

...

091002232729_wdcli_cmd7.cmd

091002232729_wdcli_cmd7.err

091002232729_wdcli_cmd7.out

WDEND or WDPID

WDENV

WDHST

cmdlist/

wdcli_cmd8

091002231841_13156_file.err

091002231853_13156_file.out

091002231904_13156_watchdog.err

091002232836_13156_watchdog.log

6-tPC2d2knO7m6GP2XC7-Q

Flags

WD Control DIRwatchdog.conf

WD CMD Exe DIR

OUT

ERR

CMDwatchdog.sh

WN

File snapshots

LFC/AMGAMounted Sh FS

www.eu-eela.eu

wdcli

• CLI to ease the WD user interaction– 20091124164201 wd>

• Uses the watchdog.conf file to get user configuration• Principal commands:

– set Set MODE (LFC/AMGA/mounted Shared FS)– show jobs Get list of monitored jobs– Attach to a monitored job– show snapshots Get the list of file snapshots– View the snapshot content– Get generic info: ENV,PID,CE,WN,Proxy …– exec Execute a given command

Interactive commands are not allowed It is possible to call the watchdog.ctrl command (use –n opt!)

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 10

www.eu-eela.eu

WD in EELA-2• Presented 1st time in E2GRIS1 at Itacuruca (Brazil)

– G-HMMER/G-InterProScan Bioinformatic – Get semi-real time info to be published on the WEB

– CrossFire Civil Protection – Get semi-real time info to view the simulation output

• Presented the 2nd time in E2GRIS2 at Qeretaro (Mexico)– HeMoLab

Bioinformatic – Long run jobs, check output files while running

– AeroVANT Engineering – Long run jobs, get data while running

– BioMD Bioinformatic – Long run job, monitor the simulation

– Seismic Sensors (planned to)

Earth Science – Monitor the job execution

• Cinefilia Recommender Systems – Monitor the computation

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 11

www.eu-eela.eu

Conclusions• WD mainly used for:

– Job monitoring (Long run)– Check/Get job produced data

• WD used as:– As a Debugging helper tool– As an application component (CrossFire)

• WD easy to integrate but needs a precise configuration– EELA-2 has 2 different AMGA server using different access rights

(EU and LA)– EELA-2 does not have shareutils (uuencode/uudecode) package

installed on the WNs. These tools available under WN path: VO_PROD_VO_EU_EELA_EU_SW_DIR or put ‘uu**code’ commands in the InputSandbox

– EELA-2 several WNs were using a different BDII, some users were unable to retrieve easily the snapshot content (LFC)

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 12

www.eu-eela.eu

Future

• Improve the User Interaction– Improve wdcli (due to the good success in E2GRIS2)

– Create tools to easily create web based front ends– Provide tools to reconstruct a file monitored incrementally

• Ease the application integration (AMGA)– uuencode/uudecode independent– provide watchdog.conf file templates for VOs

• Improve the Monitoring– Provide independent time watching cycles for each file– Provide a sandboxing mechanism for file I/O from/to WN

Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009 13

www.eu-eela.eu 14www.eu-eela.euwww.eu-eela.eu

Questions?