24
SC4 Workshop “Operational Requirements for Core Services” James Casey, IT-GD, CERN CERN, 21 st June 2005

“ Operational Requirements for Core Services ”

  • Upload
    ciqala

  • View
    86

  • Download
    3

Embed Size (px)

DESCRIPTION

“ Operational Requirements for Core Services ”. James Casey, IT-GD, CERN CERN, 21 st June 2005. SC4 Workshop. Summary. Issues as expressed by sites ASGC, CNAF, FNAL, GRIDKA, PIC, RAL, TRIUMF My synopsis of the most important issues Where we are on them… - PowerPoint PPT Presentation

Citation preview

Page 1: “ Operational Requirements for Core Services ”

SC4 Workshop

“Operational Requirements for Core Services”

James Casey, IT-GD, CERNCERN, 21st June 2005

Page 2: “ Operational Requirements for Core Services ”

CERN IT-GD

Summary

• Issues as expressed by sites• ASGC, CNAF, FNAL, GRIDKA, PIC, RAL, TRIUMF

• My synopsis of the most important issues• Where we are on them…• What are possible solutions in longer term

Page 3: “ Operational Requirements for Core Services ”

CERN IT-GD

ASGC - Features missing in core services

• Local/remote diagnostic tests to verify the functionality and configuration. • This will be helpful for

• Verifying your configuration• Generating test results that can be used as the basis for

local monitoring• Detailed step-by-step troubleshooting guides • Example configurations for complex services

• e.g VOMS, FTS• Some error message can be improved to provide

more information to facilitate troubleshooting

Page 4: “ Operational Requirements for Core Services ”

CERN IT-GD

CNAF - Outstanding issues (1/2)

• Accounting (monthly reports):• CPU usage in KSI2K-days DGAS• Wall-clock time in KSI2K-days DGAS• Disk space used in TB• Disk space allocated in TB• Tape space used in TB• Validation of raw data gathered, by comparison via different tools

• Monitoring of data transfer: GridView and SAM?• More FTS monitoring tools necessary

• (traffic load per channel, per VO)• Routing in LHC Optical Private Network?

• Backup connection to FZK becoming urgent, and a lot of traffic using the production network infrastructure, between non-associated T1-T1 and T1-T2 sites

Page 5: “ Operational Requirements for Core Services ”

CERN IT-GD

CNAF – Outstanding Issues (2/2)

• Implementation of a LHC OPN monitoring infrastructure still in its infancy

• SE Reliability when in unattended mode: greatly improved with latest Castor2 upgrade

• Castor2 performance during concurent import and export activities

Page 6: “ Operational Requirements for Core Services ”

CERN IT-GD

FNAL – Middleware additions

• It would be useful to have better hooks in the grid services to enable monitoring for 24/7 systems• We are implementing our own tests to connect to the

paging system• If the services had reasonable health monitors we could

connect to it might spare us re-implementing or missing an important element to monitor

Page 7: “ Operational Requirements for Core Services ”

CERN IT-GD

GRIDKA – Feature Requests

• improved (internal) monitoring• developers not always seem to be aware that hosts can have

more than 1 network interface. • It should be that hosts can be reached via their long living

alias and the actual name is unimportant (for reachability, not for security).

• Error messages should make sense and be human readable!• simple example :

• $ glite-gridftp-ls gsiftp://f01-015-105-r.gridka.de/pnfs/gridka.de/• (typo in the hostname ^^^)• t3076401696:p17226: Fatal error: [Thread System] GLOBUSTHREAD:

pthread_mutex_destroy() failed• [Thread System] mutex is locked (EBUSY)Aborted

Page 8: “ Operational Requirements for Core Services ”

CERN IT-GD

PIC – Some missing Features

• All in general:• Clearer error messages • Difficult to operate (eg, it should be possible to reboot a host without

affecting the service)• SEs:

• Missing a procedure for “draining” an SE or gently “take it out of production”

• Difficult to control access: for some features to be tested need the SE published in the BDII, but once is there there is no way to control who can access

• Glite-CE:• A simple way to gather the DN of the submitter, having the Local Batch

jobID (GGUS-9323)• FTS:

• Unable to delete a channel which has “cancelled” transfers• Difficult to see a) that the service is having problems, and b) then to debug

them

Page 9: “ Operational Requirements for Core Services ”

CERN IT-GD

RAL – Missing Features in File Transfer Service

• Could collect more information (endpoints) dynamically• This is happening now in 1.5

• Logs• Comparing a successful and failed transfer is quite tricky

• I can show you two 25 line logs, one for a failed and one for a successful srmcopy. The logs are completely identical.

• Having logs files that are easy to parse for alerts or errors is of course very useful.

• Offsite monitoring• How do we know a service at CERN is dead?• And what is provided to interface it to local T1 monitoring.

Page 10: “ Operational Requirements for Core Services ”

CERN IT-GD

TRIUMF – Core Services (1/2)

• 'yaim', like any tool that wants to be general and complete, ends up being complicated to implement, to debug and to maintain. • In trying to do a lot from two scripts (install_node and

configure_node) and one environment file (node-info.def) bypasses some basic principles of unix system management:

• use small, independent tools, and combine them to achieve your goal.

• Often a 'configure_node' process needs to be run multiple times to get it right. • It would help a lot if it did not repeat useless, already

completed, time-consuming 'config_crl'.

Page 11: “ Operational Requirements for Core Services ”

CERN IT-GD

TRIUMF – Core Services (2/2)

• An enhancement for the yaim configure process: • it would also be useful if the configure_node process would

contain a hook to run a user-defined post-configuration step.

• There is frequently some local issue that needs to be addressed, and we would like to have a line in the script that calls a local, generic script that we could manage, and would not be over-written during 'yaim' updates.

• The really big hurdle will always be Tier 2's (large number of sites out there). • The whole process is just difficult for the Tier 2's. • It doesn't really matter all that much what the Tier 1's say -

they will and must cope. • One should be aggressively soliciting feedback from the

Tier 2's.

Page 12: “ Operational Requirements for Core Services ”

CERN IT-GD

Top 5….

• Better logging• Missing Information (e.g. DN in transfer log)• Hard to understand logs

• Better diagnostics tools• How do I verify my configuration is correct?• … and functional for all VOs?

• Toubleshooting guides• Better error messages from tools• Monitoring

• … and interfaces to allow central/remote components to be interfaced to local monitoring system

Page 13: “ Operational Requirements for Core Services ”

CERN IT-GD

Logging

• FTS Logs have several problems:• Only access to logs via interactive login on transfer node• Plans to have full info in DB

• Will come after schema upgrade in next FTS release• CLI tools/web interface to retrieve them

• Intermediate stage is to have final reason in DB• Outstanding bug sets this to AGENT_ERROR for 90% of

messages– Should be fixed soon (I hope!)

• Logs not understandable• When SRM v2.2 rewrite is done, a lot of cleanup will (need to)

happen

Page 14: “ Operational Requirements for Core Services ”

CERN IT-GD

Diagnostic tools/ Troubleshooting guides

• SAM (Site Availability Monitoring) is the solution for diagnostics• Can run validation tests as any VO, and see the results

• System is in infancy• Tests need expanding• But the system is very easy to write tests for • … and the web interface is quite nice to use

• Troubleshooting guides• These are acknowledged needed for all services

• T-2 tutorials helped in gathering some of these materials• Look at tutorials from last week in Indico for more info

Page 15: “ Operational Requirements for Core Services ”

CERN IT-GD

SAM 2

• Tests run as operations VO: ops• sensor test submission available for all VOs• critical test set for VOs (defined using FCR)

• Availability Monitoring• aggregation of results over a certain time • site services: CE, SE, sBDII, SRM• central services: FTS, LFC, RB• status calculated in every hour → availability

• current (last 24 hours), daily, weekly, monthly

Page 16: “ Operational Requirements for Core Services ”

CERN IT-GD

SAM Portal -- main

Page 17: “ Operational Requirements for Core Services ”

CERN IT-GD

SAM -- sensor page

Page 18: “ Operational Requirements for Core Services ”

CERN IT-GD

Monitoring

• It’s acknowledged the GRIDVIEW is not enough• It’s good for “static” displays, but not good for interactive

debugging• We’re looking at other tools to parse the data

• SLAC have interesting tools for monitoring netflow data• This is very similar in format to the info we have in globus

XFERLOGs• And they even are thinking of alarm systems

• I’m interested to know what types of features such a debugging/monitoring system should have

• We’d keep it all integrated in a GRIDVIEW like-system

Page 19: “ Operational Requirements for Core Services ”

CERN IT-GD

Netflow et. al. • Peaks at known capacities and RTTs

• RTTs might suggest windows not optimized

Page 20: “ Operational Requirements for Core Services ”

CERN IT-GD

Mining data for sites

Page 21: “ Operational Requirements for Core Services ”

CERN IT-GD

Diurnal behavior

Page 22: “ Operational Requirements for Core Services ”

CERN IT-GD

One month for one site

Page 23: “ Operational Requirements for Core Services ”

CERN IT-GD

Effect of multiple streams• Dilemma what do you recommend:

• Maximize throughput but unfair, pushes other flows aside• Use another TCP stack, e.g. BIC-TCP, H-TCP etc.

Page 24: “ Operational Requirements for Core Services ”

CERN IT-GD

Thank you …