Upload
brandon-lindsey
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
LHCOPN operational working group report
Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2)on behalf of the Ops WG
LHCOPN meeting, 2008-10-16, Copenhagen
Schedule
1 - Ops WG:• Ops WG: Who, what, when• The proposed tightened operational model• Remaining work
2 – Things around GGUS and demo
2GCX - LHCOPN meeting - 2008-10-16
Background• Last LHCOPN’s meeting (CERN, June 2008)
actions on Operations– A1: Put together a working group to complete the ops models and
publish– A2: Take input from ISHARE work of GN2– A3: Clarify the operational issues with E2ECU
• What is the status of the E2ECU? What does it manage?• What is the perfsonar deployment status?• How is the E2ECU service measured?
– A4: Demonstration of the GGUS/OPN ticketing system at the next meeting
– A5: Regular tests must be part of the operational procedures
3GCX - LHCOPN meeting - 2008-10-16
Current way to follow LHCOPN’s troubles
Essence of:• Confusion
– No guideline, no role, no responsibilities• Hope
– Mail with 10 people in CC
Result:– Running around like chickens without head (c)– No transparency– Operational model required
4GCX - LHCOPN meeting - 2008-10-16
Operational working group: reloaded• 11 members after public call for membership
• Very interesting mix of viewpoints– 1 NREN, 5 sites, DANTE, EGEE
• Administrative things– project-lhcopn-opswg AT cern.ch– https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsWG
Emma Apted (DANTE) David Foster (CH-CERN) Ludwig Pregernig (CH-CERN)
Gerard Bernabeu (ES-PIC) Bruno Hoeft (DE-KIT) Franck Simon (RENATER, FR)
James Casey (CH-CERN, EGEE-SA1) Xavier Jeannin (CNRS, EGEE-SA2) Robin Tasker (UK-T1-RAL)
Guillaume Cessieux (FR-CCIN2P3, EGEE-SA2) Edoardo Martelli (CH-CERN)
5GCX - LHCOPN meeting - 2008-10-16
Main work done• Two fruitful meetings
• September 9-10th@CERN– http://indico.cern.ch/conferenceDisplay.py?confId=37175
• October 9-10th@CERN– http://indico.cern.ch/conferenceDisplay.py?confId=38583
• New method to document was powerfull
• Tightening the operational model– concrete proposal– Light and driven by things currently working
6GCX - LHCOPN meeting - 2008-10-16
The proposed operational model
• Now structured, explained and published on twiki https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel
• Key changes– Simplification– E2ECU’s role– Grid interactions removed
→
7GCX - LHCOPN meeting - 2008-10-16
Structure of the Ops model
• Foundation– Drawing convention– Actors & Information repository management
• Processes:– Incident– Change
• Maintenance
8GCX - LHCOPN meeting - 2008-10-16
Drawing conventions
Actor D
Information repository 1
A is responsible for 1 (the set up, not for its
contents)
Process E
Actor C
* Actor A(Current
implementation)
Actor B
A starts process E
A «interacts » with B
Information repository 2
B reads and writes into 1
C reads into 2
2 notifies D(alarms…)
1 and 2 exchange TT
Possible initiator of the process
= optional (relations) or not yet existing (actors and information repositories)
B may « interact » with C
9
Grid Projects(LCG (EGEE))
Sites (T0/T1)
Sites (T0/T1)
L2 Networks providers(GEANT2,NRENs)
European / Non EuropeanPublic/Private
L2 Networks providers(GEANT2,NRENs)
European / Non EuropeanPublic/Private
LHCOPN Actors
Sites (T0/T1)
LCU
Actor
L2 Networks providers(GÉANT2,NRENs…)
European / Non EuropeanPublic/Private
NOC/ Router
operators
Grid data managers
L2 NOC
Infrastructure
Operators
Users
DANTE
Operation
10
Grid TTS(GGUS)
Global web repository(Twiki)
DANTE
Actors and information repositories management
Operation
LHCOPN TTS(GGUS)
L2 Monitoring(perfSONAR
e2emon) L3 monitoring
LCU(ENOC)
Information repository
Actor
MDM BGP
A is responsible for BBA
Operational procedures
Operational contacts
Technical information
Change management DB
Statistics reports
Grid Project operation
(EGEE SA1)
L2 NOC
11
Information access
BA
BA
A reads B
A reads and writes B
Sites
LHCOPN TTS(GGUS)
L2 Monitoring(perfSONAR
e2emon)
L3 monitoring
LCU(ENOC)
L2 network
providers
Global web repository
(Twiki)Statistics
L2 NOCL2 NOCL2 NOC
12
Problem management process
Global web repository
(Twiki)
L2 - L3Monitoring
Site * Router
operators
* Grid Data
manager
LHCOPN TTS(GGUS)
A goes to process BA B
Start L3 incident management
OK L2 incident management
OK escalated incident management
BA A reads B A B A interacts with B
1
2
3
4
5
13
L3 Incident management process
Source site involved
Site involved
A notifies B
Grid Data
manager* Router
operators
Router operators
A A BB A interacts with B
Other Sites
1.2LHCOPN TTS
(GGUS)
L2 incident management
1.4
1.1
2(1.3)
BA A reads and writes BA goes to process BA B14
Scope: Router down, BGP filtering, bad routing...
Sites linked
L2 Incident management process
Sites linked
* L2 NOC
Grid Data
manager
* Router operators
All sitesLHCOPN TTS(GGUS)
* End of L3 incident management
A notifies BA A BB A interacts with B BA A reads and writes B
1.1 1.3
1.2
2
escalated incident management(3)
15
Scope: Dark fibres outages...
Escalated incident management
• If problem not understood or solved within reasonable delay
• Backup process– Started by router operator– Phoneconf with all potentially involved actors– Workplan to fix issue to be decided
GCX - LHCOPN meeting - 2008-10-16 16
Change vs Maintenance
• Change management is a top process– Tracks and document
• Change with impacts– Committed with maintenance
• Some maintenances without change...
GCX - LHCOPN meeting - 2008-10-16 17
Linked Sites Linked Sites
L3 Change Management
Source site
Grid Data
manager
Router * operators
Affected Sites
Router operators
L3 maintenance management
Global web repository
(Twiki)
All sites
A notifies BA A BB A interacts with B BA A reads and writes B
Monitoring
1.1
1.2
2.1
2.2
(2.3)
(4)
LHCOPN TTS (GGUS)
3
18
Scope: IP addresses change, new prefix propagated, new filtering
L2 Change Management
Linked site
Grid Data
managerRouter
operators
L2 maintenance management
Global web repository
(Twiki)
All sites
* L2 NOC
Monitoring
Linked Sites Linked Sites Affected Sites
Router operators
A notifies BA A BB A interacts with B BA A reads and writes B
1.1
1.2
1.3
2.1
2.2
2.3
3LHCOPN TTS (GGUS)
L3 change management
(4)
19
Scope: New LHCOPN L2 link, change of L2 network provider for a segment...
Impacted sites
L3 Maintenance management process
Source sites Grid Data
manager* Router operators
All sites
A notifies BA A BB A interacts with B BA A reads and writes B
Impacted sites
Router operators
LHCOPN TTS (GGUS)
1.1
1.2
23
20
Scope: scheduled power outage on site, router IOS upgrade, ...
Linked Sites
L2 Maintenance management process
* L2 NOC
Linked Sites Grid Data
managerRouter operators
All sites
A notifies BA A BB A interacts with B BA A reads and writes B
Linked Sites
Router operators
LHCOPN TTS (GGUS)
1.1
1.4
1.2
1.3
2
21
Scope: optical transmitter to be changed, fibre physically rerouted...
Sample workflow for L2 incident/maintenance
– Delay and reliability of the propagation+ The way it currently works!
Site A Site BNREN A * NREN B NREN C
LHCOPN TTS(GGUS) All sites
12
3
Users
4
22GCX - LHCOPN meeting - 2008-10-16
Remaining areas of work• Grid interactions (the users!)
• Grid data managers• Authentication for LHCOPN community
– Certificate!– Restricted area on twiki... But with twiki/CERN account
• Quality assessment– Network, processes, monitoring (L2)
• Implementation details– Tools (GGUS...), notifications, communication
channels...– Lot of work...
23GCX - LHCOPN meeting - 2008-10-16
Conclusion about ops model
• Not perfect ... open to improvements– Ops WG ready for improvement process– Constructive feedback is welcome
• Need also to know if this is suitable– Key responsibilities on sites– Commitment from actors to follow it?
• Guillaume.Cessieux AT cc.in2p3.fr
• Implementation needed ASAP
24GCX - LHCOPN meeting - 2008-10-16
GGUS supporting the LHCOPN
Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2)
Thanks to the GGUS team
LHCOPN meeting, 2008-10-16, Copenhagen
Strong acknowledgements
• to the GGUS team (DE-KIT / EGEE-SA1), particularly:– Torsten Antoni– Helmut Dres– Guenter GreinFor providing the LHCOPN helpdesk
26GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
Schedule
• What is GGUS• Why GGUS• Live Demo!
– Main screenshots in slides
• Remaining work
27GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
What is GGUS
28GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
Why GGUS for the LHCOPN ?
+ Existing, reliable, mature, secure, well known+ Key features: tracking of events, reminders,
notifications...+ Grid world+ Successful experience with the ENOC+ Web interface and web services access- Very complex- Grid world? Sustainability as part of EGEE
29GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
What we have for LHCOPN
• Ticket handling– By router operators– Tickets ‘public’ for anyone authenticated to GGUS
• Acting only by ‘support staff’
• Dashboard for LHCOPN tickets– Will be mapped on a calendar = planning
• Really tailored for the LHCOPN– Grid complexity removed! VO, ROC, TPM...
30GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
NetworkSupport - ENOC
GGUS architecture
…..
…
Central Application
(GGUS)
DeploymentSupport
RC 1 RC X
MiddlewareSupport
Operations Support
TPM
BIOMED ESR
DS 1
DS 5
…
MS 1
MS 8
…
ROC 1 ROC 12ROC…
RC 1 RC X…
VOSupport
ALICE
RC 1 RC X…
Interface
Webportal
LHCOPN SupportLHCOPN Support
31GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
Live Demo
32GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
GCX - LHCOPN meeting - Copenhagen - 2008-10-16
33
Submit form
Ticket view & history
GCX - LHCOPN meeting - Copenhagen - 2008-10-16
34
Update form
GCX - LHCOPN meeting - Copenhagen - 2008-10-16
35
LHCOPN dashboard
GCX - LHCOPN meeting - Copenhagen - 2008-10-16
36
Works remaining• Authentication!
– List of certificate to be gathered
• Strategy for reminder and notification– Target e-mails to be gathered
→ Private area on twiki!
• Template for common tickets– Minimum information required…
• Light documentation
37GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
Conclusion around GGUS• LHCOPN helpdesk available
• Several details to be sorted out before production use– Authentication, notification, workflow,
documentation, ...– Not yet perfect – will follow the ops model
• Will it be accepted?
38GCX - LHCOPN meeting - Copenhagen
- 2008-10-16
Questions & discussion
GCX - LHCOPN meeting - Copenhagen - 2008-10-16
39