Upload
neviah
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Unlocking Systems and Data: The Key to Network Management Innovation. Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research. 2006 IEEE/IFIP Network Operations and Management Symposium. Network-wide model auditing, “what-if,” etc. Offered Traffic, Routing, Fault. - PowerPoint PPT Presentation
Citation preview
Unlocking Systems and Data: The Key to Network Management Innovation
Charles Kalmanek
Internet & Network Systems Research V.P.AT&T Labs-Research
2006 IEEE/IFIP Network Operations and Management Symposium
CRK 04/21/23 2
Vision for IP Network Management
Approach• Manage the entire network, not network elements
• Instrument the network, rely on direct correlation of real data
• Model interactions to predict the effects of actions in advance
• Automate as much as possible, audit results
Topology, Configuration, Workflow
Offered Traffic,Routing, Fault
Network
Network-wide modelauditing, “what-if,” etc.
measure control
Goal: A robust, global, multi-service IP/MPLS network
Provisioning, Changes to the Network
Design goals, policies
CRK 04/21/23 3
Why It’s Hard Scale & Diversity Challenges
• Large, distributed networks (100,000’s of NE’s)• Complex, diverse building blocks • Ongoing maintenance, spanning multiple time zones• Fragile IP network control planes• Complex software systems on top
Constant change• Architectural change, new features & services, new protocols…• Customers join, leave, change/upgrade service• Network “events” – failures, migrations, upgrades, etc.
Measurement and data challenges• Inadequate implementation of the basics• Data often locked up in NM systems “smokestacks”• Diverse data sources, with highly variable data quality • Limited direct measurements of causality • Inadequate ability to trace events across the network
CRK 04/21/23 4
Tier-1 Service Provider Network
PoP: Point-of-PresenceP: Backbone (core) RouterPE: Provider Edge RouterCE: Customer Edge Router
AccessNetwork
Intercity
Metro
CPE CE
EPEEPE
CPCP
Customer facing PE interfaces
CPCP
CPCP
CPCP
CPCP CPCP
CPCP CPCP
PoPEPEEPE
EPEEPE
EPEEPE
OC-48 or OC-192 DWDM
Rough stats:100s of offices100s of Ps, 1000s of PEs, 10000s of CEs100,000s of transport facilities
DWDM systems
LEC
PoP
PoPCPCP
CPCPEPEEPE
CustomerNetwork
(Enterprise customer networks rival ISP’s in size
& complexity!)
CRK 04/21/23 5
Unlocking Network Data
Measurement data is essential to running the network• Marketing and customer acquisition• Network and customer care• Network engineering and capacity management• Research to improve / evolve the network
If you don’t have the data, you can’t design, manage, secure, or improve the network
If you can’t evolve systems, you can’t evolve the network
Example 1: Fault/performance management
Example 2: Router Provisioning
CRK 04/21/23 6
Network Troubleshooting
Goals• Automate the entire life cycle of event detection and repair
for every performance impacting event– Detect, Localize, Diagnose, Fix, Verify
• Drive short and long term network, operations & systems improvements
– Use forensics to reveal chronic events
Systems and Tools• Active and passive performance monitoring
– Each data source has its unique value and limitations• Maintenance and troubleshooting require correlation across
multiple data sets– Associations of customers to access circuits, router
interfaces, network policies, network elements, monitoring systems, …
CRK 04/21/23 7
Example: Cross-Layer Troubleshooting
IP composite link: multiple SONET links combined together• Example: 5 OC192s
• IP routing does not take bandwidth into account.– On component failure: how to decide between
mechanisms to take traffic off the link, as function of remaining capacity?
LA NY
LA NY
Logical IP link3 units oftraffic
3 units oftraffic congestion
1 unit of capacity
CRK 04/21/23 8
Example: Cross-Layer Troubleshooting (cont.)
Detect: • Packet loss from active measurements for a set of PE pairs
Localize/Diagnose:• Temporal correlation: PE-PE measurement alerts occurring at the
same time as flapping on several composite link members• Spatial correlation: paths where packet loss occurs contain flapping
composite link components (PE-PE measurements mapped to paths via route monitoring)
Diagnose: • Congestion due to composite link component flapping
Fix: • Short term: “cost out” the link• Permanent: repair failing components
Verify: • Packet loss alerts disappear
CRK 04/21/23 9
Example: Chronic Control Plane Outage
Detect• Active performance monitoring shows high loss at a PE
Localize/Diagnose• Correlation of performance alerts, fault data, routing updates,
configuration, and workflow logs reveals recurring pattern– OSPF sessions flap during customer provisioning on some PE
platforms• Diagnosis: BGP starves OSPF processing on this class of PEs
Fix• Short-term: process changes to control provisioning on this class of
PE• Long-term: better OSPF and BGP process scheduler for PE
Verify• High loss disappears at the PE
PE
PE
PEPE
CRK 04/21/23 10
Data Distribution Problem
• Many, diverse data feeds required• Labor-intensive and error-prone to create and maintain each feed• Ad-hoc development to convert, copy, encrypt, & ingest the data• Several groups with business critical functions need network data• Stringent delivery requirements (security, timeliness, reliability)
Network data• Network inventory• Route monitors, BGP tables• SNMP link utilization & faults• Syslog info (status, health, events)• Active path monitoring• Netflow• Other: workflow, VoIP, transport
Customer data • Access: location, circuit ID, IP
addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line,…), router info (hardware, software version)
• Trouble tickets
• Performance and SLA reports
• Service orders
CRK 04/21/23 11
Data Correlation Framework
Flexible data/systems architecture• Pluggable data-source specific collectors
• Data distribution bus
• Common real time and archival data store
• Variety of network management applications on top
Evolving domain knowledge• It’s an iterative process: exploratory data mining (EDM)
– Apply statistical tools, visualization, “hunches,” …– Export results to “case manager” for analysis
Diagnosis engines• Near real-time drill down, forensics
• Temporal and spatial event clustering
• Scalable statistical mechanisms to uncover correlations
CRK 04/21/23 12
Data/Systems Architecture
Network
Internal PortalCustomer Portal
OA&M
Topology
I/F
Netflow
Collector
L3 Control
Plane
Collector
Active
Probe
Collector
Syslog
Collector
CDR
Collector
Real-time
Network Mgt
Applications
End-to-end
Reporting
Application
Planning
Application
Surveillance
Application
Data Distribution Bus (DDB)
Data Store Component (DSC)
SNMP
Collector
GUIGUIGUIGUIGUIGUIGUIGUI
Data Distribution Bus• Publish/subscribe
system handling all incoming data feeds
• Supports multiple transport options, normalizes data to “standard” formats
• Reliably delivers data to consumers
Data Store Component• Efficient long-term
storage of operational data
• Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse
Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, …
CRK 04/21/23 13
Router Provisioning
Goal: translate service intent to network reality• Get hardware & circuits to the right place at the right time• Access & update network inventory databases• Configure routers to establish and verify the service
Challenges• Huge diversity at network element layer (dependencies on
hardware & software versions, physical configuration, vendor, etc.)
• Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing
• Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc.
• Commercial tools offer limited customizability, only solve pieces of the problem
• Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …)
CRK 04/21/23 14
Detect/Fix Discords• Non-compliance to
architectural intent– e.g., errors in route-maps
for VPNs crossing routing domains
• Config time-bombs – e.g., gaps in the ACL
perimeter defense
Additional Benefits• Assessment, Bootstrapping
automation, Decision Support
Technology• Parsers, Algorithms, Rules and
Queries encoding domain expertise : e.g., ACL analysis
Auditing
DiscordsLow level
standard form (tables)
Customer/networkdatabase
polled
queries
Router configuration
Provisioning
fix
Configuration File Analysis
CRK 04/21/23 15
Automated CPE Router Provisioning
• Technical Questionnaire• E.g., Web form• (Service Level)
• Device/service specific templates, with embedded variables and callouts to computations and databases• E.g., callouts for ports, IP addresses, ACL clauses, …
• Detailed Device Configuration commands – bundled as a “configlet”• (Network Element Level)
• Logic: allocations of ports, IP addresses, VRFs, …
CRK 04/21/23 16
Template-driven Config Generation
Executing templates in a given context (stored in a database) produces configs, similar to code generation
– Evolves easily to integrate new features, router models, access types, resiliency options
– Eliminates errors, reduces holds
– Ensures conformance to engineering guidelines
router bgp <BGP_1.CE_ASN>no synchronizationbgp log-neighbor-changesnetwork
<WAN_IF_1.NETIP:computeIpMask_Netip(<WAN_IF_1.IF_IP>,255.255.255.252)> mask 255.255.255.252
network <WAN_IF_2.NETIP:computeIpMask_Netip(<WAN_IF_2.IF_IP>,255.255.255.252)> mask 255.255.255.252
network <ROUTER.LOOPBACKIP> mask 255.255.255.255
Example: BGP configurationContext Substitution
Functional Substitution
CRK 04/21/23 17
Conclusions
Unlocking data and fault/performance management systems enables innovation
• Exploratory data mining and data correlation are essential to forensics and network maintenance automation
• Approach: Flexible data distribution and data storage architecture
Unlocking provisioning systems enables innovation • Bottom-up analysis is a useful tool for discord-detection, etc.• Template driven approach allows network engineering to add new
network features without new systems development
Challenges are legion…• How to overcome proprietary data models, systems thwarting
forensics?• How to find efficiently find needles in (massive) data haystacks?• How to raise the level of provisioning abstraction? • How to reduce the systems drag on network feature and
architecture change?