39
LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Embed Size (px)

Citation preview

Page 1: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

LHCOPN operational working group report

Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2)on behalf of the Ops WG

LHCOPN meeting, 2008-10-16, Copenhagen

Page 2: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Schedule

1 - Ops WG:• Ops WG: Who, what, when• The proposed tightened operational model• Remaining work

2 – Things around GGUS and demo

2GCX - LHCOPN meeting - 2008-10-16

Page 3: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Background• Last LHCOPN’s meeting (CERN, June 2008)

actions on Operations– A1: Put together a working group to complete the ops models and

publish– A2: Take input from ISHARE work of GN2– A3: Clarify the operational issues with E2ECU

• What is the status of the E2ECU? What does it manage?• What is the perfsonar deployment status?• How is the E2ECU service measured?

– A4: Demonstration of the GGUS/OPN ticketing system at the next meeting

– A5: Regular tests must be part of the operational procedures

3GCX - LHCOPN meeting - 2008-10-16

Page 4: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Current way to follow LHCOPN’s troubles

Essence of:• Confusion

– No guideline, no role, no responsibilities• Hope

– Mail with 10 people in CC

Result:– Running around like chickens without head (c)– No transparency– Operational model required

4GCX - LHCOPN meeting - 2008-10-16

Page 5: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Operational working group: reloaded• 11 members after public call for membership

• Very interesting mix of viewpoints– 1 NREN, 5 sites, DANTE, EGEE

• Administrative things– project-lhcopn-opswg AT cern.ch– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsWG

Emma Apted (DANTE) David Foster (CH-CERN) Ludwig Pregernig (CH-CERN)

Gerard Bernabeu (ES-PIC) Bruno Hoeft (DE-KIT) Franck Simon (RENATER, FR)

James Casey (CH-CERN, EGEE-SA1) Xavier Jeannin (CNRS, EGEE-SA2) Robin Tasker (UK-T1-RAL)

Guillaume Cessieux (FR-CCIN2P3, EGEE-SA2) Edoardo Martelli (CH-CERN)

5GCX - LHCOPN meeting - 2008-10-16

Page 6: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Main work done• Two fruitful meetings

• September 9-10th@CERN– http://indico.cern.ch/conferenceDisplay.py?confId=37175

• October 9-10th@CERN– http://indico.cern.ch/conferenceDisplay.py?confId=38583

• New method to document was powerfull

• Tightening the operational model– concrete proposal– Light and driven by things currently working

6GCX - LHCOPN meeting - 2008-10-16

Page 7: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

The proposed operational model

• Now structured, explained and published on twiki https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel

• Key changes– Simplification– E2ECU’s role– Grid interactions removed

7GCX - LHCOPN meeting - 2008-10-16

Page 8: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Structure of the Ops model

• Foundation– Drawing convention– Actors & Information repository management

• Processes:– Incident– Change

• Maintenance

8GCX - LHCOPN meeting - 2008-10-16

Page 9: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Drawing conventions

Actor D

Information repository 1

A is responsible for 1 (the set up, not for its

contents)

Process E

Actor C

* Actor A(Current

implementation)

Actor B

A starts process E

A «interacts » with B

Information repository 2

B reads and writes into 1

C reads into 2

2 notifies D(alarms…)

1 and 2 exchange TT

Possible initiator of the process

= optional (relations) or not yet existing (actors and information repositories)

B may « interact » with C

9

Page 10: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Grid Projects(LCG (EGEE))

Sites (T0/T1)

Sites (T0/T1)

L2 Networks providers(GEANT2,NRENs)

European / Non EuropeanPublic/Private

L2 Networks providers(GEANT2,NRENs)

European / Non EuropeanPublic/Private

LHCOPN Actors

Sites (T0/T1)

LCU

Actor

L2 Networks providers(GÉANT2,NRENs…)

European / Non EuropeanPublic/Private

NOC/ Router

operators

Grid data managers

L2 NOC

Infrastructure

Operators

Users

DANTE

Operation

10

Page 11: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Grid TTS(GGUS)

Global web repository(Twiki)

DANTE

Actors and information repositories management

Operation

LHCOPN TTS(GGUS)

L2 Monitoring(perfSONAR

e2emon) L3 monitoring

LCU(ENOC)

Information repository

Actor

MDM BGP

A is responsible for BBA

Operational procedures

Operational contacts

Technical information

Change management DB

Statistics reports

Grid Project operation

(EGEE SA1)

L2 NOC

11

Page 12: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Information access

BA

BA

A reads B

A reads and writes B

Sites

LHCOPN TTS(GGUS)

L2 Monitoring(perfSONAR

e2emon)

L3 monitoring

LCU(ENOC)

L2 network

providers

Global web repository

(Twiki)Statistics

L2 NOCL2 NOCL2 NOC

12

Page 13: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Problem management process

Global web repository

(Twiki)

L2 - L3Monitoring

Site * Router

operators

* Grid Data

manager

LHCOPN TTS(GGUS)

A goes to process BA B

Start L3 incident management

OK L2 incident management

OK escalated incident management

BA A reads B A B A interacts with B

1

2

3

4

5

13

Page 14: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

L3 Incident management process

Source site involved

Site involved

A notifies B

Grid Data

manager* Router

operators

Router operators

A A BB A interacts with B

Other Sites

1.2LHCOPN TTS

(GGUS)

L2 incident management

1.4

1.1

2(1.3)

BA A reads and writes BA goes to process BA B14

Scope: Router down, BGP filtering, bad routing...

Page 15: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Sites linked

L2 Incident management process

Sites linked

* L2 NOC

Grid Data

manager

* Router operators

All sitesLHCOPN TTS(GGUS)

* End of L3 incident management

A notifies BA A BB A interacts with B BA A reads and writes B

1.1 1.3

1.2

2

escalated incident management(3)

15

Scope: Dark fibres outages...

Page 16: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Escalated incident management

• If problem not understood or solved within reasonable delay

• Backup process– Started by router operator– Phoneconf with all potentially involved actors– Workplan to fix issue to be decided

GCX - LHCOPN meeting - 2008-10-16 16

Page 17: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Change vs Maintenance

• Change management is a top process– Tracks and document

• Change with impacts– Committed with maintenance

• Some maintenances without change...

GCX - LHCOPN meeting - 2008-10-16 17

Page 18: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Linked Sites Linked Sites

L3 Change Management

Source site

Grid Data

manager

Router * operators

Affected Sites

Router operators

L3 maintenance management

Global web repository

(Twiki)

All sites

A notifies BA A BB A interacts with B BA A reads and writes B

Monitoring

1.1

1.2

2.1

2.2

(2.3)

(4)

LHCOPN TTS (GGUS)

3

18

Scope: IP addresses change, new prefix propagated, new filtering

Page 19: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

L2 Change Management

Linked site

Grid Data

managerRouter

operators

L2 maintenance management

Global web repository

(Twiki)

All sites

* L2 NOC

Monitoring

Linked Sites Linked Sites Affected Sites

Router operators

A notifies BA A BB A interacts with B BA A reads and writes B

1.1

1.2

1.3

2.1

2.2

2.3

3LHCOPN TTS (GGUS)

L3 change management

(4)

19

Scope: New LHCOPN L2 link, change of L2 network provider for a segment...

Page 20: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Impacted sites

L3 Maintenance management process

Source sites Grid Data

manager* Router operators

All sites

A notifies BA A BB A interacts with B BA A reads and writes B

Impacted sites

Router operators

LHCOPN TTS (GGUS)

1.1

1.2

23

20

Scope: scheduled power outage on site, router IOS upgrade, ...

Page 21: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Linked Sites

L2 Maintenance management process

* L2 NOC

Linked Sites Grid Data

managerRouter operators

All sites

A notifies BA A BB A interacts with B BA A reads and writes B

Linked Sites

Router operators

LHCOPN TTS (GGUS)

1.1

1.4

1.2

1.3

2

21

Scope: optical transmitter to be changed, fibre physically rerouted...

Page 22: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Sample workflow for L2 incident/maintenance

– Delay and reliability of the propagation+ The way it currently works!

Site A Site BNREN A * NREN B NREN C

LHCOPN TTS(GGUS) All sites

12

3

Users

4

22GCX - LHCOPN meeting - 2008-10-16

Page 23: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Remaining areas of work• Grid interactions (the users!)

• Grid data managers• Authentication for LHCOPN community

– Certificate!– Restricted area on twiki... But with twiki/CERN account

• Quality assessment– Network, processes, monitoring (L2)

• Implementation details– Tools (GGUS...), notifications, communication

channels...– Lot of work...

23GCX - LHCOPN meeting - 2008-10-16

Page 24: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Conclusion about ops model

• Not perfect ... open to improvements– Ops WG ready for improvement process– Constructive feedback is welcome

• Need also to know if this is suitable– Key responsibilities on sites– Commitment from actors to follow it?

• Guillaume.Cessieux AT cc.in2p3.fr

• Implementation needed ASAP

24GCX - LHCOPN meeting - 2008-10-16

Page 25: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

GGUS supporting the LHCOPN

Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2)

Thanks to the GGUS team

LHCOPN meeting, 2008-10-16, Copenhagen

Page 26: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Strong acknowledgements

• to the GGUS team (DE-KIT / EGEE-SA1), particularly:– Torsten Antoni– Helmut Dres– Guenter GreinFor providing the LHCOPN helpdesk

26GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 27: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Schedule

• What is GGUS• Why GGUS• Live Demo!

– Main screenshots in slides

• Remaining work

27GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 28: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

What is GGUS

28GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 29: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Why GGUS for the LHCOPN ?

+ Existing, reliable, mature, secure, well known+ Key features: tracking of events, reminders,

notifications...+ Grid world+ Successful experience with the ENOC+ Web interface and web services access- Very complex- Grid world? Sustainability as part of EGEE

29GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 30: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

What we have for LHCOPN

• Ticket handling– By router operators– Tickets ‘public’ for anyone authenticated to GGUS

• Acting only by ‘support staff’

• Dashboard for LHCOPN tickets– Will be mapped on a calendar = planning

• Really tailored for the LHCOPN– Grid complexity removed! VO, ROC, TPM...

30GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 31: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

NetworkSupport - ENOC

GGUS architecture

…..

Central Application

(GGUS)

DeploymentSupport

RC 1 RC X

MiddlewareSupport

Operations Support

TPM

BIOMED ESR

DS 1

DS 5

MS 1

MS 8

ROC 1 ROC 12ROC…

RC 1 RC X…

VOSupport

ALICE

RC 1 RC X…

Interface

Webportal

LHCOPN SupportLHCOPN Support

31GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 32: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Live Demo

32GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 33: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

GCX - LHCOPN meeting - Copenhagen - 2008-10-16

33

Submit form

Page 34: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Ticket view & history

GCX - LHCOPN meeting - Copenhagen - 2008-10-16

34

Page 35: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Update form

GCX - LHCOPN meeting - Copenhagen - 2008-10-16

35

Page 36: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

LHCOPN dashboard

GCX - LHCOPN meeting - Copenhagen - 2008-10-16

36

Page 37: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Works remaining• Authentication!

– List of certificate to be gathered

• Strategy for reminder and notification– Target e-mails to be gathered

→ Private area on twiki!

• Template for common tickets– Minimum information required…

• Light documentation

37GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 38: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Conclusion around GGUS• LHCOPN helpdesk available

• Several details to be sorted out before production use– Authentication, notification, workflow,

documentation, ...– Not yet perfect – will follow the ops model

• Will it be accepted?

38GCX - LHCOPN meeting - Copenhagen

- 2008-10-16

Page 39: LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, 2008-10-16, Copenhagen

Questions & discussion

GCX - LHCOPN meeting - Copenhagen - 2008-10-16

39