27
LHCOPN: Operations status LHCOPN: Operations status Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Embed Size (px)

DESCRIPTION

What was reported in the TTS? 395 tickets in the TTS since –381 solved (96%) –7 in progress Normal ongoing issues or scheduled work –5 unsolved Mainly performance issue not understood Duplicate or erroneous tickets cancelled or postponed work –2 assigned Twiki review pending (CA-TRIUMF, NDGF) LHCOPN meeting, Barcelona, GCX3

Citation preview

Page 1: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

LHCOPN: Operations statusLHCOPN: Operations status

Guillaume.Cessieux @ cc.in2p3.frNetwork team, FR-CCIN2P3LHCOPN meeting, Barcelona, 2010-06-29

Page 2: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

OutlineOutline

Operations status– TTS stats– Change management– Backup tests

Ongoing– Relationships with WLCG– Around GGUS

LHCOPN meeting, Barcelona, 2010-06-29GCX 2

Page 3: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

What was reported in the TTS?What was reported in the TTS?

395 tickets in the TTS since 2009-02– 381 solved (96%)– 7 in progress

• Normal ongoing issues or scheduled work– 5 unsolved

• Mainly performance issue not understood• Duplicate or erroneous tickets• cancelled or postponed work

– 2 assigned• Twiki review pending (CA-TRIUMF, NDGF)

LHCOPN meeting, Barcelona, 2010-06-29GCX 3

Page 4: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

5 long standing issues5 long standing issues

1 infrastructure– #55697: 2010-03-10, FR-CCIN2P3, BGP flapping with CH-CERN

• Ongoing issue, root cause not yet found, ~1 flap/day, not service affecting

4 administratives– #48335: 2009-04-30, Additional prefix for CA-TRIUMF

• Missing notification of acceptance from NDGF, UK-T1-RAL and US-FNAL-CMS

– #52959: 2009-11-04, UK-T1-RAL, Review of LHCOPN twiki• Only missing routing policies to be udpated

– #56415: 2010-03-12, NDGF, Review of LHCOPN twiki• Not started

– #56417: 2010-03-12, CA-TRIUMF, Review of LHCOPN twiki• Not started

Ops phoneconf seems not so successful to get this solved

LHCOPN meeting, Barcelona, 2010-06-29GCX 4

Page 5: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Overall breakdown per category and type of problem Overall breakdown per category and type of problem

LHCOPN meeting, Barcelona, 2010-06-29GCX 5

80% of tickets are L2 related events

Page 6: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Number of tickets put in the TTS per monthNumber of tickets put in the TTS per month

LHCOPN meeting, Barcelona, 2010-06-29GCX 6

AVG: 23 tickets/month

Page 7: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Ticket’s ownership per siteTicket’s ownership per site

LHCOPN meeting, Barcelona, 2010-06-29GCX 7

Nearly 1/4th of ticketsNL-T1 has 6 LHCOPN links

Page 8: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Ownership of tickets per month per siteOwnership of tickets per month per site

LHCOPN meeting, Barcelona, 2010-06-29GCX 8

Page 9: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Kind of tickets per monthKind of tickets per month

LHCOPN meeting, Barcelona, 2010-06-29GCX 9

Page 10: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

KPI-1: Infrastructure vs operations behaviorKPI-1: Infrastructure vs operations behavior

LHCOPN meeting, Barcelona, 2010-06-29GCX 10

Less than 15 “significant” events / month?

Page 11: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Change managementChange management

Only 5 tickets flagged as « change » !

– Is the infrastructure that stable? Flag set on GGUS submit interface

LHCOPN meeting, Barcelona, 2010-06-29GCX 11

Page 12: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Conclusion on TTS statsConclusion on TTS stats L2 events are regular then well managed NL-T1 seems to have a very good

implementation of the Ops model Administrative stuff frozen

– Twiki review, change management etc.• Not fascinating but minimum vital

Decrease in the monthly number of tickets– Feeling from sites that not all tickets are useful– Need to ensure minimum vital is here by

correlating with monitoring

LHCOPN meeting, Barcelona, 2010-06-29GCX 12

Page 13: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Backup tests?Backup tests?

Previously agreed: Each resilience possibility should be demonstrated at least once a year– Failures can count as a test if they are properly

reported (particularly paths’ symmetry) Only two sites have reported a backup test or

a demonstration of backup efficiency for 2010• https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010

No recent change in the infrastructure so no need to test?

LHCOPN meeting, Barcelona, 2010-06-29GCX 13

Page 14: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Following MDM deployment related issuesFollowing MDM deployment related issues

Only deployment issues?– Physical set up etc.– Interaction with sites

Should be tracked through tickets– Still in GN3 helpdesk system?

• GN3 people have no access to GGUS• LHCOPN people have no access to GN3 helpdesk

Should be visible– How, where?

LHCOPN meeting, Barcelona, 2010-06-29GCX 14

Page 15: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

What’s missing to go ahead?What’s missing to go ahead? Network SLD

– What is a « significant » event requiring care etc. Monitoring

– Have we service impacting events?– Correlation with Operations– Evidences instead of feelings

• Particularly for performance issues Fill the gap between WLCG Ops and LHCOPN

Ops– Gap by design but bridge expected

LHCOPN meeting, Barcelona, 2010-06-29GCX 15

Page 16: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Relationships with WLCG (1/4)Relationships with WLCG (1/4)

Lot of work previously done by Wayne– Clear overview during Vancouver’s presentation

• http://indico.cern.ch/materialDisplay.py?contribId=17&materialId=slides&confId=59842

• Agreement from WLCG about!• Only missing careful implementation?

Minimum relationships should be made of– Exchanges during meetings– Operational exchange

• Clear process and KPI around– Facilitated with tickets’ linking

• Dashboard of service affecting issue

– Sharing LHCOPN monitoring information

LHCOPN meeting, Barcelona, 2010-06-29GCX 16

Page 17: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Relationships with WLCG (2/4)Relationships with WLCG (2/4) Main stoppers

– Meetings• Not acting and represented as a whole community through a LHCOPN

representative or “liaison officer”• Too often asked to be there « just in case »

– Operational exchanges• Complex and hard to get used to them with very few issues involving

WLCG (~1 each 3 months?)– Post mortem analysis hard as a lot of exchanges seems off the record– Now high resiliency network– A lot of things are site’s internal processes

• Common use of GGUS is giving a false feeling of relationships– We are not doing user support!

• Mistake to assume we can handle all network issue from our isolated island with our closed set of supporters

– Need coordination and action from other teams (storage…)– Problem to interact with WLCG supporters

LHCOPN meeting, Barcelona, 2010-06-29GCX 17

Page 18: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Relationships with WLCG (3/4)Relationships with WLCG (3/4)

Sample expected workflow for WLCG inquiries:

LHCOPN meeting, Barcelona, 2010-06-29GCX 18

Site Contact

Site Network Team

Relevant WLCG Team

Experiment

WLCG GGUS

Internal Ticket System

Site Network Team

LHCOPN GGUS

Relevant Network

Team

WLCG GGUS

Networking?Yes No

LHCOPNRelated?Yes No

LHCOPNGGUS

Internal TicketSystem

Site Contact WLCG GGUS

Page 19: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Relationships with WLCG (4/4)Relationships with WLCG (4/4) Workplan

– A dashboard showing tickets impacting WLCG• Done: Particular view on the dashboard

– Ability to link WLCG and LHCOPN tickets• Upcoming: Parent/Child relationship

– Cross reference still here (no associated workflow)

• But problem to interact with WLCG supporters– No cross helpdesk access to update tickets

– On site processes• Push for carefull implementation of « Site’s contact »?

– Internal site’s processes

LHCOPN meeting, Barcelona, 2010-06-29GCX 19

Page 20: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (1/6): GGUS status listAround GGUS (1/6): GGUS status list

LHCOPN meeting, Barcelona, 2010-06-29GCX 20

Page 21: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (2/6): LHCOPN submit interfaceAround GGUS (2/6): LHCOPN submit interface

LHCOPN meeting, Barcelona, 2010-06-29GCX 21

Page 22: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (3/6): WLCG submit interfaceAround GGUS (3/6): WLCG submit interface

LHCOPN meeting, Barcelona, 2010-06-29GCX 22

Page 23: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (4/6): Merging ProsAround GGUS (4/6): Merging Pros Should we unify/merge LHCOPN helpdesk within the standard

GGUS?+ Consider networks like other resources (computing, storage, software...)

• Network are not standalone resource, coordination between sites required

+ Maybe better fit in reporting reports• True

+ Now standard way to send enquiries to sites?• Yes for Grid issues, not always for network teams, less Grid centred, unwilling to go at

project level• But for a project’s dedicated network?

+ Maybe some central manpower could be gained+ Regularly chasing pending tickets...• Very unclear who can do that, and if this will be successful (cf. twiki review)

+ Less specific software and support from GGUS• No key economy for them: Still using same database, hosts etc. and sharing some code

+ Ease interactions with WLCG supporters• Issues evolving in two different worlds• Write access to our helpdesk restricted to network teams

LHCOPN meeting, Barcelona, 2010-06-29GCX 23

Page 24: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (5/6): Merging ConsAround GGUS (5/6): Merging Cons

– We have something stable and working• Definitely, but that should not prevent improvements

– Completely tailored for us and closely matching our operational model

• Seems hard to merge frontends and unify workflows

– Be far from interferences with Grid world• Isolation could be achieved with particular views?• Was a key concern from network teams

– Not shaped to do user support• But coordinating network teams• Maintenance not in GGUS

• No strong preference from the GGUS team• Confirmed, not a problem for them

LHCOPN meeting, Barcelona, 2010-06-29GCX 24

Page 25: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Around GGUS (6/6): Conclusion about mergingAround GGUS (6/6): Conclusion about merging

Our helpdesk was designed to coordinate network teams not to support WLCG users– Really different from standard GGUS– Appears as an internal coordination tool

Benefits not so clear, was mainly thought to ease integration in WLCG Ops– But we are not doing Grid Ops

• Network issue ≠ Grid issue• Networks are not standalone resources (storage, cpu etc.)

– Similar to software issues handled externally (in savannah)

• We should not be customer faced– Selected inquiries going through storage teams (“Site contact”)

Let’s also see how EGI will converge around user support

LHCOPN meeting, Barcelona, 2010-06-29GCX 25

Page 26: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

Conclusion about LHCOPN OperationsConclusion about LHCOPN Operations

Ops status: Clear place for improvements– Unequal following of processes by sites because missing clear

feeling of usefulness and evidence of network failures– L2 events well handled while administrative workflow is forgotten

WLCG relationships to be implemented and nurtured– Performance issues need smart and timely solving– Skeleton of coordination with WLCG Ops to be improved

No outstanding benefit to unify LHCOPN helpdesk with WLCG’s one– Maybe better and enough to carefully link our workflow with WLCG Ops

Wait monitoring & SLDs before next set of improvements– Timeline?– Particularly revitalise tickets’ handling and ensure minimum is here

LHCOPN meeting, Barcelona, 2010-06-29GCX 26

Page 27: LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

QuestionsQuestions

1. Pushing for administrative things to be done?– twiki review, backup tests etc.

2. LHCOPN representative?– Maybe not responsible for Ops but more

liaising as a single contact point– Share and justify the workload

3. GGUS merging– Opinion?

LHCOPN meeting, Barcelona, 2010-06-29GCX 27