10
Production Production Coordination Coordination Staff Retreat Staff Retreat July 21, 2010 July 21, 2010 Dan Fraser – Production Dan Fraser – Production Coordinator Coordinator

Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Embed Size (px)

Citation preview

Page 1: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Production CoordinationProduction Coordination

Staff RetreatStaff RetreatJuly 21, 2010July 21, 2010

Dan Fraser – Production CoordinatorDan Fraser – Production Coordinator

Page 2: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Production WBSProduction WBSIdentify & resolve OSG issuesIdentify & resolve OSG issues Prod calls, look at usage patterns, metrics …Prod calls, look at usage patterns, metrics …

T3 LiaisonT3 LiaisonDocs, Security, Xrootd, WLCG client, Coord w/ Hiro, Asoka…Docs, Security, Xrootd, WLCG client, Coord w/ Hiro, Asoka…

Manage relationships & raise issues with the ET when Manage relationships & raise issues with the ET when necessarynecessary T1/T2/T3 Admins – identify requirementsT1/T2/T3 Admins – identify requirements

Assist the ET team w/Operations, Sites, VOs, & Assist the ET team w/Operations, Sites, VOs, & Education/trainingEducation/training

Lead & Coordinate plans to improve the OSG facilityLead & Coordinate plans to improve the OSG facility Glide-in effort (paper, VO transitions, testing format)Glide-in effort (paper, VO transitions, testing format)

Page 3: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Some Production Examples…Some Production Examples…Effort from the entire teamEffort from the entire team SE-only solution for Atlas T3sSE-only solution for Atlas T3s ITIL like processes for OperationsITIL like processes for Operations Updated process for CA testing prior to productionUpdated process for CA testing prior to production CEMON issues (hanging the BDII)CEMON issues (hanging the BDII) CERN BDII not reporting (RG data limit exceeded)CERN BDII not reporting (RG data limit exceeded) CERN BDII not a high priority (pushed but no mvmnt)CERN BDII not a high priority (pushed but no mvmnt) Transitioning sites to use the new Gratia collector address (in Transitioning sites to use the new Gratia collector address (in

progress)progress) Urgent security updates for sites running Condor/GratiaUrgent security updates for sites running Condor/Gratia Addressing VO Issues:Addressing VO Issues:

LIGO Production running well (Rob E.)LIGO Production running well (Rob E.) Between Rank #1 & #2 on OSGBetween Rank #1 & #2 on OSG

SBGRID reaching new peaks (~7000 parallel jobs)SBGRID reaching new peaks (~7000 parallel jobs)

Page 4: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

A View from the Production A View from the Production CoordinatorCoordinator

What are the biggest problems in OSG?What are the biggest problems in OSG? Supporting VO’s Is difficultSupporting VO’s Is difficult

Lowering the barrier for scaling across sitesLowering the barrier for scaling across sites Site differences often require site-by-site investigationSite differences often require site-by-site investigation

Effort in progress to understand this (Dan, Abhishek)Effort in progress to understand this (Dan, Abhishek) Big win possible with Glide-insBig win possible with Glide-ins

New paper comparing job submission strategies New paper comparing job submission strategies

How to get opportunistic storageHow to get opportunistic storage Current method is to talk to each site…Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan)New strategies being explored (Tanya, Brian, Dan)

Page 5: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

OSG Health MonitoringOSG Health MonitoringAll links now on the production pageAll links now on the production page

https://twiki.grid.iu.edu/bin/view/Production/WebHome

Usage ChartsUsage Charts

Weekly CallsWeekly Calls

OSG Data movementOSG Data movement

Job/Error ratiosJob/Error ratios

DOE display showing last 24 hoursDOE display showing last 24 hours

and much more …and much more …

Page 6: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Solving Production ProblemsSolving Production Problems

Solving problems is a TEAM sportSolving problems is a TEAM sport

The weekly production call has key people from The weekly production call has key people from all the teams that are needed to solve problemsall the teams that are needed to solve problems

CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, MetricsSTG, Security, Operations, Metrics

Problems accurately prioritized and channeled Problems accurately prioritized and channeled to the correct avenueto the correct avenue

Sometimes solved on the call.Sometimes solved on the call.

Forewarning to prepare for upcoming issues.Forewarning to prepare for upcoming issues.

Page 7: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Example ProblemsExample Problems

Handling of job pre-emption (LIGO / D0)Handling of job pre-emption (LIGO / D0)

VO Package Validation probe neededVO Package Validation probe needed GIP “truth in advertising”GIP “truth in advertising”

LIGO switch to GT2 and also Condor-G job LIGO switch to GT2 and also Condor-G job submissionsubmission

Condor scaling limits in GridMon (Atlas)Condor scaling limits in GridMon (Atlas)

Globus LSF gatekeeper bug (D0/CMS)Globus LSF gatekeeper bug (D0/CMS)

Security Drill successes (for T1)Security Drill successes (for T1)

Gratia probe introduction & ITB testingGratia probe introduction & ITB testing

Page 8: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.STEP09 monitoring (partially successful)STEP09 monitoring (partially successful)

IceCube management of opportunistic storageIceCube management of opportunistic storage

Gratia file transfer data catch upGratia file transfer data catch up

Transition from VORS to myOSGTransition from VORS to myOSG

New location for RSV probes and ability to New location for RSV probes and ability to update from the “production” cacheupdate from the “production” cache Also, ensure that config_OSG does not update the Also, ensure that config_OSG does not update the

probes automaticallyprobes automatically

Root Cause Analysis of CMS BDII outageRoot Cause Analysis of CMS BDII outage

Page 9: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.Plan to localize data transfer information Plan to localize data transfer information and upload summary transfer packets.and upload summary transfer packets.

Globus memory leak was causing frequent Globus memory leak was causing frequent reboots at BNL.reboots at BNL.

Site name mapping problem to enable Site name mapping problem to enable different names internal to OSG.different names internal to OSG.

OIM display difference (http vs https)OIM display difference (http vs https)

Site admin meeting & materials prep to help Site admin meeting & materials prep to help sites upgrade to OSG 1.2.sites upgrade to OSG 1.2.

Page 10: Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.Condor problem with directory creation in Condor problem with directory creation in a multiple gateway scenario. (Nebraska)a multiple gateway scenario. (Nebraska)

Gratia collector problem with handling Gratia collector problem with handling records that accumulate faster than they records that accumulate faster than they can be processed.can be processed.

LIGO/Pegasus transition to use BDII data LIGO/Pegasus transition to use BDII data instead of central probe data.instead of central probe data.