21
1 MonALISA Team MonALISA Team Iosif Legrand, Harvey Newman, Iosif Legrand, Harvey Newman, Ramiro Voicu Ramiro Voicu , Costin Grigoras, Ciprian Dobre, Alexandru Costan Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities MonALISA capabilities for the LHCOPN for the LHCOPN LHCOPN meeting March 2010 London USLHCNet Team USLHCNet Team Harvey Newman, Artur Barczyk, Harvey Newman, Artur Barczyk, Ramiro Voicu Ramiro Voicu , Azher Mughal, Sandor Rozsa , Azher Mughal, Sandor Rozsa

1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Embed Size (px)

Citation preview

Page 1: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

1

MonALISA TeamMonALISA TeamIosif Legrand, Harvey Newman, Iosif Legrand, Harvey Newman, Ramiro VoicuRamiro Voicu,,

Costin Grigoras, Ciprian Dobre, Alexandru CostanCostin Grigoras, Ciprian Dobre, Alexandru Costan

MonALISA capabilities MonALISA capabilities for the LHCOPNfor the LHCOPN

LHCOPN meeting March 2010 London

USLHCNet TeamUSLHCNet TeamHarvey Newman, Artur Barczyk, Harvey Newman, Artur Barczyk,

Ramiro VoicuRamiro Voicu, Azher Mughal, Sandor Rozsa, Azher Mughal, Sandor Rozsa

Page 2: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

2

OutlineOutline

MonALISA Framework

Architecture

Data handling

Automatic actions

USLHCNet

Network topology

Monitoring modules

Reliable monitoring & accounting

Alarms & triggers

Conclusions2 Ramiro Voicu LHCOPN London March 2010

Page 3: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

The MonALISA ArchitectureThe MonALISA Architecture

3

Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients

Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients

Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events

JINI-Lookup Services Secure & Public

MonALISA services

Proxies

HL services

Agents

Network of

Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions

Fully Distributed System with no Single Point of Failure3 Ramiro Voicu LHCOPN London March 2010

Page 4: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

MonALISA Service & Data HandlingMonALISA Service & Data Handling

4

Data Store

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Clients or Higher Level

Services

WS Clients andservice

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres

AGENTSAGENTS

FILTERS / TRIGGERSFILTERS / TRIGGERS

Monitoring ModulesMonitoring ModulesCollects any type of information

Dynamic (Re)Loading

Push and Pull

4 Ramiro Voicu LHCOPN London March 2010

Page 5: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Two levels of decisions:

local (autonomous),

global (correlations).

Actions triggered by:

values above/below given thresholds,

absence/presence of values,

correlations between any values.

Action types:

alerts (emails/instant msg/atom feeds),

running an external command,

automatic charts annotations in the repository,

running custom code, like securely ordering a ML service to (re)start a site service.

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Local and Global Decision FrameworkLocal and Global Decision Framework

Global ML

Services

5 Ramiro Voicu LHCOPN London March 2010

Page 6: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCNetUSLHCNet

USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.

Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.

The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.

Hybrid network: uses both Ciena CD and Force10 routers

6 transatlantic 10G links at the moment

6 Ramiro Voicu LHCOPN London March 2010

Page 7: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet ML weather mapUSLHCnet ML weather map

7 Ramiro Voicu LHCOPN London March 2010

Page 8: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Monitoring modulesMonitoring modules

We developed a set of monitoring modules for USLHCNet network devices:

Force10 (SNMP & sFlow)

Traffic per interface

sFlow traffic

Link status monitoring

Ciena Core Director (TL1 – Transaction Language1)

ETTP (Ethernet Termination Point) traffic

EFLOW (Ethernet Flow) traffic

OSRP (routing protocol) topology

VCG Provisioned / Available Bandwidth

Dynamic circuits inside the optical core of the network

Ping module/MLPing trigger which sends alarms in case of packet loss8 Ramiro Voicu LHCOPN London March 2010

Page 9: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet monitoringUSLHCnet monitoring

MonALISA

@GVA

MonALISA

@CHI

MonALISA

@NYC

MonALISA

@AMSSNMP

TL1

SNMP

9 Ramiro Voicu LHCOPN London March 2010

Page 10: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet redundant monitoringUSLHCnet redundant monitoring

MonALISA

@GVA

MonALISA

@CHI

MonALISA

@NYC

MonALISA

@AMS

Each CircuitEach Circuitis monitored at bothis monitored at bothends by at least twoends by at least twoMonALISA services;MonALISA services;the monitored datathe monitored datais aggregated by is aggregated by global filters in global filters in the repositorythe repository

10 Ramiro Voicu LHCOPN London March 2010

Page 11: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Local and global filtersLocal and global filters

Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems

The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services

The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.

We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.

11 Ramiro Voicu LHCOPN London March 2010

Page 12: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link

Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment

12 Ramiro Voicu LHCOPN London March 2010

Page 13: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet: ALL EFLOW traffic - last 2 months USLHCnet: ALL EFLOW traffic - last 2 months

13 Ramiro Voicu LHCOPN London March 2010

Page 14: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic

14 Ramiro Voicu LHCOPN London March 2010

Page 15: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

USLHCnet: Ciena alarms monitoringUSLHCnet: Ciena alarms monitoring

15 Ramiro Voicu LHCOPN London March 2010

Page 16: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

16 Ramiro Voicu LHCOPN London March 2010

Topology monitoring and discoveryTopology monitoring and discovery

NETWORKS

AS

ROUTERS

Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display

Page 17: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Storage discovery in AliceStorage discovery in Alice

17 Ramiro Voicu LHCOPN London March 2010

France

Italy

USA

Russia

Nordic Countries

distance(IP, IP)distance(IP, IP) Same IP-class networkSame IP-class network Common domain nameCommon domain name Same ASSame AS Same country (+ function of RTT between Same country (+ function of RTT between

the respective AS-es if known)the respective AS-es if known) If distance between the AS-es is known, use itIf distance between the AS-es is known, use it Same continentSame continent Far awayFar away

distance(IP, Set<IP>): Client's public IP to all distance(IP, Set<IP>): Client's public IP to all known IPs for the storageknown IPs for the storage

C. Grigoras (Alice) – ACAT 2010C. Grigoras (Alice) – ACAT 2010

Page 18: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

18 Ramiro Voicu LHCOPN London March 2010

FDT Bandwidth tests in Alice (E2E av bw)FDT Bandwidth tests in Alice (E2E av bw)

Newer kernelTuned TCP Buffers

Default kernels Default TCP BuffersDifferent trends = different kernels

100 Mbps network card

1 Gbps network card

http://monalisa.cern.ch/FDT/http://monalisa.cern.ch/FDT/

Page 19: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

ConclusionsConclusions

The MonALISA framework provides a flexible and reliable monitoring infrastructure

350+ installed services, 1.5M+ unique parameters, 25kHz value updates

Truly distributed architecture with no single points of failure

Highly modular platform

Automatic decision taking capability at both local and global levels

USLHCNet provides a hybrid network with support for circuit oriented network services

Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime (100% in the last 6 months)

We are investigating dynamic provisioning of circuits from collaborating agents

http://monalisa.caltech.edu

http://repository.uslhcnet.org

19 Ramiro Voicu LHCOPN London March 2010

Page 20: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

Dynamic restorationof lightpath if a segment has problems

Monitoring Optical SwitchesMonitoring Optical Switches

20 Ramiro Voicu LHCOPN London March 2010

Page 21: 1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN

CERNGeneva

CALTECHPasadena

Starlight

Manlan

USLHCnet

Internet2

Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery

“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s

1

23

4

FDT Transfer

4 Fiber cuts simulations

200+ MBytes/secFrom a 1U Node

4 fiber cut emulations

21 Ramiro Voicu LHCOPN London March 2010