15
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem Workshop

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Embed Size (px)

Citation preview

Page 1: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

EGEE-III INFSO-RI-222667

Enabling Grids for E-sciencE

www.eu-egee.org

Overview of STEP09 monitoring issuesJulia Andreeva, IT/GS

09.07.2009

STEP09 Postmortem Workshop

Page 2: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Jumping to conclusions

• A variety of tests run during STEP09 --->

a variety of monitoring systems used• We certainly were not running blind, and could follow

pretty well what is going on• For following of the Experiment activities in most

cases the VO-specific monitoring systems had been used

• For checking the health of the services and of the sites VOs mostly relied on the centrally provided monitoring systems like SAM and SLS

2Julia Andreeva IT-GS

Page 3: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Short questioner sent to all 4 experiments

• Do you think that your VO had all necessary monitoring tools and they provided required functionality in order to follow STEP09 ?

• What has to be improved ?• Which monitoring systems had been used for every

particular test?• Was it possible to see the overall picture (all 4

experiments)?• Wish list …

Thanks a lot for all people providing input and sending answers.

3Julia Andreeva IT-GS

Page 4: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

ALICE

• ALICE did not suffer any lack of information regarding monitoring and was able to follow STEP09 activities pretty well.

• Both for transfer (rate) and job processing ALICE used native ALICE monitoring service based on MonAlisa . For transfer efficiency and errors ALICE used Dashboard.

• For looking in the overall picture regarding transfer ALICE used GridView.

• No particular requests regarding monitoring.

4Julia Andreeva IT-GS

Page 5: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

ATLAS

• In general, ATLAS did have necessary monitoring infrastructure to follow STEP09, though some issues had been seen and there is a room for improvements (my conclusion from ATLAS answers)

5Julia Andreeva IT-GS

Page 6: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

ATLAS transfer monitoring

For data transfer Dashboard had been used.

Good for overall data transfer. Noticed problem: Can magnify a single error so much that it's hard to see anything else (filter out

known problems would be useful)

What is missing for specific things needed for operations: 1. Monitoring of broken subscriptions

2. Monitoring of queues of subscriptions3. Monitoring of subscriptions not picked up4. Information ordered by source5. Development of drill down plots giving efficiency and bandwidthconsumed in a given time period6. Some work on the pre-stage monitoring, especially for staged filesand datasets

The work Ricardo did on the 2D plots, generated on the client side, looks tobe like a very healthy development. This is probably the way to go forthe more flexible monitoring ATLAS needs for the future.

6Julia Andreeva IT-GS

Page 7: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

ATLAS. Job processing monitoring.

• PANDA and Dashboard were used for productions and analysis.

• Production monitoring is in a good shape. PANDA is very useful for debugging eventual problems,

Dashboard provides better historical views.• Monitoring of the analysis jobs needs considerable

improvements. Problems seen with Dashboard Job Monitoring for analysis: 1). Instability of the MonAlisa server which had to be rebooted almost

every day. Might be wrong configuration , CMS MonAlisa server works just

perfectly under much higher load than the ATLAS one. To be checked with MonAlisa experts.

2). In general ATLAS version of Dashboard job monitoring differs from the CMS one which is constantly improving ( working from both sides CRAB and Dashboard). Have to apply to the ATLAS instance the modifications done on the CMS Dashboard.

7Julia Andreeva IT-GS

Page 8: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

ATLAS (continuation)

• Monitoring of the central services ATLAS considers SLS as a good infrastructure for service

monitoring and is using it for monitoring of its services.• Looking in overall picture (4 VOs) Not so much. WLCG daily operations meetings usually

communicated the necessary information.• General comments regarding the future development - At the moment all monitoring is an aggregation of lower

levelinformation. ATLAS needs to find some way of building up an ATLAS Grid Dashboard that looks at some higher level metrics, e.g., number of functional test datasets subscribed in the last 6 hours (if this is low, there is a trouble trouble).

- In the future ATLAS foresees slow control systems built on this monitoring, so all monitoring systems should provide machine-readable format , not just plots.

8Julia Andreeva IT-GS

Page 9: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

CMS

• Same as for ATLAS. In general, CMS has a monitoring infrastructure in place necessary to follow in detail its’ computing activities, though some work and improvements are foreseen.

9Julia Andreeva IT-GS

Page 10: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

CMS transfer monitoring

• PHEDEX was used. No particular issues were mentioned in the CMS reports regarding transfer monitoring

10Julia Andreeva IT-GS

Page 11: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

CMS Production monitoring

• CMS used multiple systems.

T0AST for T0 monitoring, native glideins monitoring and CMS Dashboard for monitoring of the reprocessing

• Known issues (in fact known from the CCRC08)

- Insufficient reporting from the ProdAgent to Dashboard.

ProdAgent (PA) does not report to Dashboard job status information from the user interface, for example when job is killed or aborted.

CPU and Wall Clock time, number of processed events are not reported from ProdAgent to Dashboard as well

11Julia Andreeva IT-GS

Page 12: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

CMS analysis monitoring

• Users are mainly relying on the output of ‘CRAB –status command’ and Dashboard Task monitoring.

Dashboard Task monitoring is extensively used by 50-100 users daily (80-130 distinct analysis users daily are submitting their jobs to the GRID)

• For STEP09 the overall picture was required. CMS Dashboard interactive UI, CMS Dashboard programmatic interface

and native glideins monitoring were used

• Issues - Reporting to Dashboard from jobs submitted via CRAB server to condor-glideins

was in process of debugging during STEP09. Due to it Dashboard statistics for glideins jobs was a bit higher than in reality.

- Dashboard historical views provide information in terms of jobs, not in terms of CPU or WallClock time. CPU and WallClock distributions are being added in the new version of the historical view which is under development

• Improvements foreseen- Understand and provide comprehensive picture for Analysis Support team. Most

of needed information exists in Dashboard. Dashboard team is working together with the CMS to come up with appropriate interface for Analysis Support shifters.

The twiki page created by CMS for STEP09 analysis test provides a good input for Dashboard developers as well.

12Julia Andreeva IT-GS

Page 13: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

CMS (continuation)

• Looking in the overall picture (all 4 experiments)

Same as ATLAS. Were too busy to see what other experiments were doing.

In case CMS needed to understand issues at the particular site mostly relied on input provided by site administrators.

Did not have a chance to validation of the new systems like SiteView, mostly due to time restrictions.

13Julia Andreeva IT-GS

Page 14: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

LHCb

• Both for transfer and data processing monitoring used Dirac portal which provided sufficient information to follow STEP09 activities

• For status of CEs at the sites used SAM portal and Dashboard interface for VO-specific SAM tests.

• Foreseen improvements:

Correlate monitoring and accounting information from DIRAC + SAM test results +  GGUS portal + GOCDB downtime information for a more automatized management of LHCb computing resources. For example to avoid situations when the site is banned without good reason.

14Julia Andreeva IT-GS

Page 15: EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE  Overview of STEP09 monitoring issues Julia Andreeva, IT/GS 09.07.2009 STEP09 Postmortem

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Conclusions

• Existing monitoring systems though not being perfect did provide necessary information to follow the STEP09 activities.

• The issues and problems seen during STEP09 define the short term development plans in the monitoring area.

15Julia Andreeva IT-GS