IT Monitoring WG IT/CS Monitoring System

Preview:

DESCRIPTION

IT Monitoring WG IT/CS Monitoring System. Virginie Longo. September 14th 2011. Summary. CS Monitoring Systems Spectrum CA Performance Analysis Others Tools Data storage Requirements NMS Status Requirements Researches. CS Monitoring systems. Spectrum CA. Description: - PowerPoint PPT Presentation

Citation preview

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

IT Monitoring WG

IT/CS Monitoring System

Virginie Longo September 14th 2011

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Summary

CS Monitoring Systems• Spectrum CA• Performance Analysis• Others Tools

Data storage Requirements

• NMS Status• Requirements• Researches

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

CS Monitoring systems

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Spectrum CA

Description:• Commercial Tool• Fault management oriented system • Root Cause Analysis/ alarm Correlation• Topology View• Service Manager => Relation With SLS View• Basic Performance manager

Volumes: • ~3000 devices monitored• Support 3K Laser devices for simple alarm (UP/DOWN)• Thousands of attributes polled and analyzed• 6GB of data events over 30 days

Monitoring Protocols:• SNMP and ICMP

Þ Information only feed by SNMP (No remote agent)• Few other support : DNS / DHCP / TRACEROUTE /NTP

/HTTP• Few home maid scripts for DHCP, web monitoring.

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Alarm Monitoring

Spectrum Architecture (Storage system)

Spectrum DB

Models , topology, current polling value ,alarms

SNMP

SSLogger

Oracle

Stats(CSR)

Oracle

Alarm History(LANDB)

Alarm Notifier

Spectrum System Non Spectrum system

Mysql

Events

Remote Mysql

Service Manager

SLS

Devices Info

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Performance Analysis

Statistics Architecture - Mix home maid system and Spectrum tool- Extraction data from Spectrum to Oracle DB- Data consolidation into RRD.- Displayed on Netstat website (PHP).

Volumes:- ~9000 models (port + devices) for 24K of RRDs- 36 Metrics- 157 Attributes- ~160K entries load into Oracle DB for 5MN of poll- Data kept 1 months for oracle- 2 years of consolidated data in RRDs.

Note : Metric is a group of attributes such as Bandwidth = in/out bits and in/out packets.

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Performance Analysis

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Other Tools

Syslog event recording- Gathering all log from network devices- Stored into Oracle DB- Accessible from CSDB- Filtering and propagation by notification

LHCOPN : Perfsonar Tool- Decentralized networks tool- OWD, latency and throughput regular test- Other tools like traceroute - LHCOPN network analysis

Implementation ongoing, testing phase with 1BG link, security tests not complete yet.(www.perfosnar.net)

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Data storage

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Data Storage

Summary:• Spectrum proprietary DBs for core and alarms • Mysql database for events and service manager• Oracle database for stats (CSR) and alarm

history (LANDB)• Oracle database for Syslog info• Standalone Mysql database for Perfsonar tools.

Þ Too many different type of storage.Þ Missing correlation between Syslog and SNMP

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Requirements

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

NMS Status

• Advantages :- Root cause analysis efficient- Correct Event- Alarm management- High availability - Really good topology views (useful for intervention group)- Support NICE users- Very good level of filtering (topology, alarms)

- Notification support

• Negative points / Weakness- Expensive- Polling limitation is almost reached

(new version with complete redraw of polling system will arrive in 2 years)- Not a performance system: can’t handle 50K of statistics- Integration of non certificated manufacturer is complex- Data collection mostly limited to SNMP (changes ongoing)

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Requirements

Mandatory: Root Cause Analysis High polling system :1-2mn for critical nodes 3-5mn for others Network topology representation Notifications (SMS/ MAIL/XMPP) and general console Distributed environment High Availability System Complete performance management IPv6 Support

Nice to have : Autodiscovery system Mobile version Oracle centralized database

Numbers and storage time : Polling capacity for at least 5K nodes Performance statistics for 56K of ports Data lifetime: 1 month without aggregation, max with aggregation Devices Alarm: around 2 years

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Researches

List of tools which fit better :• Icinga: Nagios like (forked) (Not Yet Tested) • Zabbix: Large polling scale, open source, notification, Oracle database,

distributed (NYT)(http://www.zabbix.com/features.php)

• Solarwind: commercial but include performance and less expensive (NYT)• Opennms :

Open source - Completely customizable High polling system with distributed environment Events correlation, Alarm management, notification Many data collection support (SNMP, HTML, JMX, JDBC, NAGIOS-NSCLIENT)

(http://www.opennms.org/about/)

Links :• http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems• http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Thanks Questions ?

Recommended