ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang

Embed Size (px)

DESCRIPTION

Goal Aimed at networked system management Better tools for – Discovering system states – Enhancing system availability – Monitoring network and system statistics – Error detection and correction – Fault tolerance for specific network applications (such as web server)

Citation preview

ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang Challenge How can we manage all these machines? Goal Aimed at networked system management Better tools for Discovering system states Enhancing system availability Monitoring network and system statistics Error detection and correction Fault tolerance for specific network applications (such as web server) Overview Distributed agents gathering information Centralized Control Unit (CCU) monitors and analyzes data. Takes control action if needed Script language for automatic decision making Web browser user interface SNMP Tool EchoMe Daemon Centralized Control Unit Information collection Machine information Network information Information analysis Individual Machine analysis Collaborative network analysis Action System modification Network routing Information Collection Two approaches investigated EchoMe Daemons running on hosts and reporting system information to server SNMP to discover router connectivity and states Daemon mostly for collecting local information. Much more detailed SNMP for network connectivity EchoMe Daemon 1. Automatically discover a node (node reporting stage) EchoMe Daemon start up as machine boot Send up OS type/machine info to CCU Register a session in CCU 2. CCU sends to node a monitor program base on nodes OS/Machine type and execute it on the node. 3. Monitor program send up information packet periodically to CCU. Router Connectivity Discovery by SNMP Routers implemented SNMP Program can run on any host within Millennium Given a router (can get from local hosts gateway information), query its ipRouteTable Traverse all its neighboring routers, performing the same query Recursion stops at specified distance System Information Number and speed of the CPUs Total physical and swap memory Installed System Clock Uptime Kernel Version Percent CPU user, nice, system and idle One, five and fifteen minute load averages Number of running processes and total number of processes Amount of free, shared, buffered, cached and swap memory Network Information Network Interfaces /proc/dev or CTL_NET/AF_LINK SNMP: interface.ifTable ARP cache direct neighbors /proc/arp or RTF_LLINFO SNMP: ip.ipNetToMediaTable Route Table /proc/route or NET_RT_DUMP SNMP: ip.ipRouteTable Information Analysis CCU a relational database Front end, parsing engine Individual Node Analysis Collaborative Analysis Parsing Engine IPACKET is in standard XML format IPACKET use incremental update, new packet specifies differences from previous packet. Parsing Engine parses the IPACKET into objects and does the insertion to iface accordingly. DATA IFACE Tables The client node register an unique nodeid in iface_node_table It starts a session for reporting information to CCU Each time, client node reports information by sending up an information packet. (ipacket) CCU process this packet, create an unique statement id from iface_index_table and parse information into each iface_?DATA_table. Individual Node Analysis Clean up iface_?data_table by transferring and categorizing data into each nodes own data table. A background process runs on CCU. Examples: Network statistic overtime table Network route change reporting Network usage of nodes. (packets, tcp/udp connection counts) Nodes system state overtime table Nodes configuration change table Collaborative Analysis Group up specify information in the iface_?data_tables and ninfo_?data_tables to generate special tables for user viewing/analysis. Examples Network connectivity graph Network graph between two node or route Network snapshot table All nodes current network statistic table All nodes current state table Interface to View Analysis Web interface Viewable under web browser Web session Display analysis Take action input from user Java Servlet + JSP Security control Data Objects map with tables in collaborative analysis Action Daemon capable of receiving and executing binary programs from CCU Command module issues command in response to certain events Add pseudo interface to a host Reroute a host Initialize new program Etc. Security OpenSSL encryption EchoMe Daemon Run as nobody System Modification Program needs to do suexec (ROOT PASSWORD requires) System Stat Table Transcripts for SNMP Router Discovery Iterating neighbors of IP address: (A9E533A1) IP address: (A9E533E9) IP address: (A9E533A5) IP address: (A9E533A7) IP address: (A9E533A8) IP address: (A9E53221) IP address: (A9E53281) IP address: (A9E533A6) IP address: (A9E533A9) IP address: (A9E533EA) In getIPRouteTable. nHops = 8 Setting target to Partial Router Connectivity on Millennium Discovered by SNMP Conclusion Information collection methods feasible Automatic discovery Comprehensive and accurate information about system Needs user feedback Future Work More (or less) features based on user feedback User interface More on information analysis and decision making Fully deploy on millennium