Upload
norah-jordan
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
HEPiX – 9 May 2008 1Network monitoring in ATLAS – [email protected]
Advanced Monitoring Techniques for the ATLAS TDAQ Network
Matei CiobotaruCERN
University of California, Irvine“Politehnica” University of Bucharest
on behalf of the ATLAS Networking Group:B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici
HEPiX – 9 May 2008 2Network monitoring in ATLAS – [email protected]
The ATLAS TDAQ Network – Role
The ATLAS Trigger and Data Acquisition Network (TDAQ) handles the data transfers from the ATLAS detector to the analysis and storage nodes
Built with Gigabit Ethernet switches and routers
Sustained rates of 150 Gbit/s
The experiment relies on the network to function 24/7 with a minimal number of failures
ATLAS detector
TDAQ system
HEPiX – 9 May 2008 3Network monitoring in ATLAS – [email protected]
2 concentrator switches per rack
5 “big” chassis-based devices at the core
The ATLAS TDAQ Network – Photos
Almost 3000 devices and 5000 network connections…
How to make sure everything is working correctly?
2500 computers installed in 90 racks
HEPiX – 9 May 2008 4Network monitoring in ATLAS – [email protected]
Inside this talk
Requirements in terms in network management
Commercial software we are using
Tools we developed in-house
Services for users, integration with ATLAS
Plans for the future
The big picture
HEPiX – 9 May 2008 5Network monitoring in ATLAS – [email protected]
ATLAS Requirements
Installation– Ease the equipment registration, inventory and verification– Configure the devices
Operation– Check the state of health of devices and links– Monitor traffic conditions, raise alarms when needed– Assist the user in navigating the realm of information– Integration with the ATLAS TDAQ software
Diagnostics– Provide aids to the admin in case something goes wrong– Be able to suggest solutions to problems
Com
plex
ity
Manage a large local area network which has to be very reliable and which has very high throughput requirements
HEPiX – 9 May 2008 6Network monitoring in ATLAS – [email protected]
Equipment registration
ATLAS equipment needs to be registered in four databases
Only some databases support batch registrations, others require manual intervention may lead to inconsistencies
Developed a web application to cope with this situation
– Central place for querying all the information about a device
– Ability to cross-check the data across all databases detect incomplete/incorrect registrations
HEPiX – 9 May 2008 7Network monitoring in ATLAS – [email protected]
Equipment inventory
Network diagrams for ATLAS are made in Microsoft Visio using the NetDesign package
We created tools which discover what really exists in the network (what is connected where)
Developed an application which compares the two data sources (Visio and Auto-discovery) mismatches are detected and corrected in the field if necessary
For the network documentation – we also generate automatically a printable “report” with all the connectivity
Visio
Network Discovery
HEPiX – 9 May 2008 8Network monitoring in ATLAS – [email protected]
Network configuration (1)
In ATLAS we have more than 200 switches
– Different vendors– Different mechanisms for
configuration and monitoring (telnet, SNMP, web)
Q: How to access all devices in a transparent manner?
– A: Bring them all under a common denominator (common interface)
Q: How to automatize network management tasks?
– A: Write scripts (little programs)
sw_script = Set of Python modules which can be used as building blocks for network management solutions
Common programming interface to all devices (object-oriented)
“Intelligent” tools for configuration and monitoring can be developed
switches + scripting = sw_scripthttp://cern.ch/ciobota/projects/sw_script/
HEPiX – 9 May 2008 9Network monitoring in ATLAS – [email protected]
Interactive session with sw_script
# Start the Python interpreter$ python2.5
# Load the sw_script module>>> import sw_script
# Create an object associated with the switch (a Cisco device in this case)>>> sw = sw_script.Cisco_Catalyst_6500_Switch(ip_address = “192.168.100.59");
# List the ports available on this device>>> sw.get_port_names(); [’1/1’, ’1/2’, ’1/3’, ’1/4’, ....
# Get all the information available for an interface>>> sw.get(“1/4"); [(’rx_packets’, 519.0), (’rx_bytes’, 127937.0), (’rx_discards’, 0.0), (’rx_errors’, 0.0), (’tx_packets’, 11199.0),(’tx_bytes’, 1111661.0), (’tx_discards’, 0.0), (’tx_errors’, 0.0), (’description’, ’GigabitEthernet1/4’), (’link_state’, ’up’), (’mac_addr’, [’00:90:27:8F:94:E3’])]
# Set the description (ifAlias) of an interface>>> sw.set_interface_alias(“1/4”, “Uplink to Core Router”)
# Show the serial number of this device>>> print sw.get_serial_number() FOC0913U075
sw_script is responsible for more than a half of our network management toolbox
Features– Supports devices from different vendors
– Network topology auto-discovery
– Can do traffic monitoring in real-time
– Works as a module, can be easily embedded into other apps
HEPiX – 9 May 2008 10Network monitoring in ATLAS – [email protected]
Network configuration (2)
In ATLAS, we have programs which use sw_script to perform configuration changes on devices:– defining VLANs– enabling protocols: spanning tree, time
synchronization, etc.– setting interface aliases (descriptions)
We use Python scripts to perform unattended firmware upgrades
For keeping track of configuration files we plan to use ZipTie (open-source software)
HEPiX – 9 May 2008 11Network monitoring in ATLAS – [email protected]
Basic monitoring
Spectrum from Computer Associates software for device health and traffic monitoring (used by the CERN IT department)
Monitors devices, raises alarms in case of failures Auto-discovery for almost all network connections Historical info – Gathers statistics from all devices
– Throughput and error rates saved every 30 seconds
Limitations– The Spectrum GUI is hard to use– It is not easy to integrate with 3rd party apps– Limited support for network performance monitoring– Basic support for querying historical traffic data– No support for device configuration – Virtually no features for diagnostics
Spectrum GUI
We developed software to fill-in the gaps
HEPiX – 9 May 2008 12Network monitoring in ATLAS – [email protected]
Navigating in the realm of monitoring data
Spectrum produces 3 plots for each network interface. We shall have 5000 ports and 15000 plots to look at…
We developed tools to browse, query and analyze the traffic plots.
HEPiX – 9 May 2008 13Network monitoring in ATLAS – [email protected]
Network browser
HEPiX – 9 May 2008 14Network monitoring in ATLAS – [email protected]
Searching and aggregating plots
HEPiX – 9 May 2008 15Network monitoring in ATLAS – [email protected]
Scanning for traffic events
HEPiX – 9 May 2008 16Network monitoring in ATLAS – [email protected]
Integration with ATLAS software
Network Panel– Shows network monitoring
information relevant to an ATLAS data acquisition run
Alarm Watcher– Forwards alarms from Spectrum
into the ATLAS “official” messaging channels
IS Feeder– Publish network statistics to the
Information Services, a monitoring sub-system in ATLAS
The network Panel
HEPiX – 9 May 2008 17Network monitoring in ATLAS – [email protected]
Network visualization – 2D approach
Application which shows a topological map of the network
Colors the connections in real-time in function of their state and usage
The overloaded links are detected easily
Good navigation features (zoom, pan) Based on GUESS, a Java application
for visualizing graphs– http://graphexploration.cond.org/
We developed a network monitoring plug-in for GUESS
HEPiX – 9 May 2008 18Network monitoring in ATLAS – [email protected]
Network visualization – 3D approach (1)
Each object contains a panel with traffic information (updated in real-time)
Containers (racks, rooms) show aggregate values
Technologies used: X3D, Java and the Octaga Player
3D model of the network Racks, switches and computers
Furniture in the 3D space Navigation similar to Google Earth
HEPiX – 9 May 2008 19Network monitoring in ATLAS – [email protected]
Network visualization – 3D approach (2)
HEPiX – 9 May 2008 20Network monitoring in ATLAS – [email protected]
Real-time traffic monitoring
Connections for one switch (with traffic values)
The ATLAS applications running now in the network
Real-time global top (most active connections)
HEPiX – 9 May 2008 21Network monitoring in ATLAS – [email protected]
Diagnostics
For immediate response, we look in Spectrum and in the sw_script web pages
Human inspection of traffic plots (aggregates) – we search for abnormal patterns and correlations between plots
We have a collection of scripts to test different things– Checking that machines are configured properly and
connections are ok
For bandwidth-related issues we use iperf
All the network operations are documented in a knowledge base (wiki)
HEPiX – 9 May 2008 22Network monitoring in ATLAS – [email protected]
Plans for the future
Better visualization techniques for traffic plots
Analysis tools for monitoring data. Pattern detection and recognition (periodic events, monotonic variations, etc.)
Add support for sFlow, the standard for statistical sampling – very useful to diagnose network congestion
Design and implement an expert system which will help us troubleshoot network issues
HEPiX – 9 May 2008 23Network monitoring in ATLAS – [email protected]
The big picture
Historical traffic data
Real-time traffic info
Dynamic web-pagesBrowse, search and
aggregate2D and 3D network
visualization
ATLAS software – network status and alarms
Equipmentconfiguration
Device healthmonitoring
Equipment auto-discovery, inventory and registration
Commercial package In-house development
sw_script & co.Spectrum