Upload
annice-anderson
View
224
Download
0
Embed Size (px)
DESCRIPTION
What was reported in the TTS? 395 tickets in the TTS since –381 solved (96%) –7 in progress Normal ongoing issues or scheduled work –5 unsolved Mainly performance issue not understood Duplicate or erroneous tickets cancelled or postponed work –2 assigned Twiki review pending (CA-TRIUMF, NDGF) LHCOPN meeting, Barcelona, GCX3
Citation preview
LHCOPN: Operations statusLHCOPN: Operations status
Guillaume.Cessieux @ cc.in2p3.frNetwork team, FR-CCIN2P3LHCOPN meeting, Barcelona, 2010-06-29
OutlineOutline
Operations status– TTS stats– Change management– Backup tests
Ongoing– Relationships with WLCG– Around GGUS
LHCOPN meeting, Barcelona, 2010-06-29GCX 2
What was reported in the TTS?What was reported in the TTS?
395 tickets in the TTS since 2009-02– 381 solved (96%)– 7 in progress
• Normal ongoing issues or scheduled work– 5 unsolved
• Mainly performance issue not understood• Duplicate or erroneous tickets• cancelled or postponed work
– 2 assigned• Twiki review pending (CA-TRIUMF, NDGF)
LHCOPN meeting, Barcelona, 2010-06-29GCX 3
5 long standing issues5 long standing issues
1 infrastructure– #55697: 2010-03-10, FR-CCIN2P3, BGP flapping with CH-CERN
• Ongoing issue, root cause not yet found, ~1 flap/day, not service affecting
4 administratives– #48335: 2009-04-30, Additional prefix for CA-TRIUMF
• Missing notification of acceptance from NDGF, UK-T1-RAL and US-FNAL-CMS
– #52959: 2009-11-04, UK-T1-RAL, Review of LHCOPN twiki• Only missing routing policies to be udpated
– #56415: 2010-03-12, NDGF, Review of LHCOPN twiki• Not started
– #56417: 2010-03-12, CA-TRIUMF, Review of LHCOPN twiki• Not started
Ops phoneconf seems not so successful to get this solved
LHCOPN meeting, Barcelona, 2010-06-29GCX 4
Overall breakdown per category and type of problem Overall breakdown per category and type of problem
LHCOPN meeting, Barcelona, 2010-06-29GCX 5
80% of tickets are L2 related events
Number of tickets put in the TTS per monthNumber of tickets put in the TTS per month
LHCOPN meeting, Barcelona, 2010-06-29GCX 6
AVG: 23 tickets/month
Ticket’s ownership per siteTicket’s ownership per site
LHCOPN meeting, Barcelona, 2010-06-29GCX 7
Nearly 1/4th of ticketsNL-T1 has 6 LHCOPN links
Ownership of tickets per month per siteOwnership of tickets per month per site
LHCOPN meeting, Barcelona, 2010-06-29GCX 8
Kind of tickets per monthKind of tickets per month
LHCOPN meeting, Barcelona, 2010-06-29GCX 9
KPI-1: Infrastructure vs operations behaviorKPI-1: Infrastructure vs operations behavior
LHCOPN meeting, Barcelona, 2010-06-29GCX 10
Less than 15 “significant” events / month?
Change managementChange management
Only 5 tickets flagged as « change » !
– Is the infrastructure that stable? Flag set on GGUS submit interface
LHCOPN meeting, Barcelona, 2010-06-29GCX 11
Conclusion on TTS statsConclusion on TTS stats L2 events are regular then well managed NL-T1 seems to have a very good
implementation of the Ops model Administrative stuff frozen
– Twiki review, change management etc.• Not fascinating but minimum vital
Decrease in the monthly number of tickets– Feeling from sites that not all tickets are useful– Need to ensure minimum vital is here by
correlating with monitoring
LHCOPN meeting, Barcelona, 2010-06-29GCX 12
Backup tests?Backup tests?
Previously agreed: Each resilience possibility should be demonstrated at least once a year– Failures can count as a test if they are properly
reported (particularly paths’ symmetry) Only two sites have reported a backup test or
a demonstration of backup efficiency for 2010• https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010
No recent change in the infrastructure so no need to test?
LHCOPN meeting, Barcelona, 2010-06-29GCX 13
Following MDM deployment related issuesFollowing MDM deployment related issues
Only deployment issues?– Physical set up etc.– Interaction with sites
Should be tracked through tickets– Still in GN3 helpdesk system?
• GN3 people have no access to GGUS• LHCOPN people have no access to GN3 helpdesk
Should be visible– How, where?
LHCOPN meeting, Barcelona, 2010-06-29GCX 14
What’s missing to go ahead?What’s missing to go ahead? Network SLD
– What is a « significant » event requiring care etc. Monitoring
– Have we service impacting events?– Correlation with Operations– Evidences instead of feelings
• Particularly for performance issues Fill the gap between WLCG Ops and LHCOPN
Ops– Gap by design but bridge expected
LHCOPN meeting, Barcelona, 2010-06-29GCX 15
Relationships with WLCG (1/4)Relationships with WLCG (1/4)
Lot of work previously done by Wayne– Clear overview during Vancouver’s presentation
• http://indico.cern.ch/materialDisplay.py?contribId=17&materialId=slides&confId=59842
• Agreement from WLCG about!• Only missing careful implementation?
Minimum relationships should be made of– Exchanges during meetings– Operational exchange
• Clear process and KPI around– Facilitated with tickets’ linking
• Dashboard of service affecting issue
– Sharing LHCOPN monitoring information
LHCOPN meeting, Barcelona, 2010-06-29GCX 16
Relationships with WLCG (2/4)Relationships with WLCG (2/4) Main stoppers
– Meetings• Not acting and represented as a whole community through a LHCOPN
representative or “liaison officer”• Too often asked to be there « just in case »
– Operational exchanges• Complex and hard to get used to them with very few issues involving
WLCG (~1 each 3 months?)– Post mortem analysis hard as a lot of exchanges seems off the record– Now high resiliency network– A lot of things are site’s internal processes
• Common use of GGUS is giving a false feeling of relationships– We are not doing user support!
• Mistake to assume we can handle all network issue from our isolated island with our closed set of supporters
– Need coordination and action from other teams (storage…)– Problem to interact with WLCG supporters
LHCOPN meeting, Barcelona, 2010-06-29GCX 17
Relationships with WLCG (3/4)Relationships with WLCG (3/4)
Sample expected workflow for WLCG inquiries:
LHCOPN meeting, Barcelona, 2010-06-29GCX 18
Site Contact
Site Network Team
Relevant WLCG Team
Experiment
WLCG GGUS
Internal Ticket System
Site Network Team
LHCOPN GGUS
Relevant Network
Team
WLCG GGUS
Networking?Yes No
LHCOPNRelated?Yes No
LHCOPNGGUS
Internal TicketSystem
Site Contact WLCG GGUS
Relationships with WLCG (4/4)Relationships with WLCG (4/4) Workplan
– A dashboard showing tickets impacting WLCG• Done: Particular view on the dashboard
– Ability to link WLCG and LHCOPN tickets• Upcoming: Parent/Child relationship
– Cross reference still here (no associated workflow)
• But problem to interact with WLCG supporters– No cross helpdesk access to update tickets
– On site processes• Push for carefull implementation of « Site’s contact »?
– Internal site’s processes
LHCOPN meeting, Barcelona, 2010-06-29GCX 19
Around GGUS (1/6): GGUS status listAround GGUS (1/6): GGUS status list
LHCOPN meeting, Barcelona, 2010-06-29GCX 20
Around GGUS (2/6): LHCOPN submit interfaceAround GGUS (2/6): LHCOPN submit interface
LHCOPN meeting, Barcelona, 2010-06-29GCX 21
Around GGUS (3/6): WLCG submit interfaceAround GGUS (3/6): WLCG submit interface
LHCOPN meeting, Barcelona, 2010-06-29GCX 22
Around GGUS (4/6): Merging ProsAround GGUS (4/6): Merging Pros Should we unify/merge LHCOPN helpdesk within the standard
GGUS?+ Consider networks like other resources (computing, storage, software...)
• Network are not standalone resource, coordination between sites required
+ Maybe better fit in reporting reports• True
+ Now standard way to send enquiries to sites?• Yes for Grid issues, not always for network teams, less Grid centred, unwilling to go at
project level• But for a project’s dedicated network?
+ Maybe some central manpower could be gained+ Regularly chasing pending tickets...• Very unclear who can do that, and if this will be successful (cf. twiki review)
+ Less specific software and support from GGUS• No key economy for them: Still using same database, hosts etc. and sharing some code
+ Ease interactions with WLCG supporters• Issues evolving in two different worlds• Write access to our helpdesk restricted to network teams
LHCOPN meeting, Barcelona, 2010-06-29GCX 23
Around GGUS (5/6): Merging ConsAround GGUS (5/6): Merging Cons
– We have something stable and working• Definitely, but that should not prevent improvements
– Completely tailored for us and closely matching our operational model
• Seems hard to merge frontends and unify workflows
– Be far from interferences with Grid world• Isolation could be achieved with particular views?• Was a key concern from network teams
– Not shaped to do user support• But coordinating network teams• Maintenance not in GGUS
• No strong preference from the GGUS team• Confirmed, not a problem for them
LHCOPN meeting, Barcelona, 2010-06-29GCX 24
Around GGUS (6/6): Conclusion about mergingAround GGUS (6/6): Conclusion about merging
Our helpdesk was designed to coordinate network teams not to support WLCG users– Really different from standard GGUS– Appears as an internal coordination tool
Benefits not so clear, was mainly thought to ease integration in WLCG Ops– But we are not doing Grid Ops
• Network issue ≠ Grid issue• Networks are not standalone resources (storage, cpu etc.)
– Similar to software issues handled externally (in savannah)
• We should not be customer faced– Selected inquiries going through storage teams (“Site contact”)
Let’s also see how EGI will converge around user support
LHCOPN meeting, Barcelona, 2010-06-29GCX 25
Conclusion about LHCOPN OperationsConclusion about LHCOPN Operations
Ops status: Clear place for improvements– Unequal following of processes by sites because missing clear
feeling of usefulness and evidence of network failures– L2 events well handled while administrative workflow is forgotten
WLCG relationships to be implemented and nurtured– Performance issues need smart and timely solving– Skeleton of coordination with WLCG Ops to be improved
No outstanding benefit to unify LHCOPN helpdesk with WLCG’s one– Maybe better and enough to carefully link our workflow with WLCG Ops
Wait monitoring & SLDs before next set of improvements– Timeline?– Particularly revitalise tickets’ handling and ensure minimum is here
LHCOPN meeting, Barcelona, 2010-06-29GCX 26
QuestionsQuestions
1. Pushing for administrative things to be done?– twiki review, backup tests etc.
2. LHCOPN representative?– Maybe not responsible for Ops but more
liaising as a single contact point– Share and justify the workload
3. GGUS merging– Opinion?
LHCOPN meeting, Barcelona, 2010-06-29GCX 27