44
Network Management Richard Mortier Microsoft Research, Cambridge (Guest lecture, Digital Communications II)

Network Management Richard Mortier Microsoft Research, Cambridge (Guest lecture, Digital Communications II)

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Network Management

Richard Mortier

Microsoft Research, Cambridge

(Guest lecture, Digital Communications II)

Overview

Introduction Abstractions IP network components IP network management protocols Pulling it all together An alternative approach

Overview

Introduction What’s it all about then?

Abstractions IP network components IP network management protocols Pulling it all together An alternative approach

What is network management? One point-of-view: a large field full of acronyms

EMS, TMN, NE, CMIP, CMISE, OSS, AN.1, TL1, EML, FCAPS, ITU, ...

(Don’t ask me what all of those mean, I don’t care!) From question.com:

In 1989, a random of the journalistic persuasion asked hacker Paul Boutin “What do you think will be the biggest problem in computing in the 90s?” Paul's straight-faced response: “There are only 17,000 three-letter acronyms.” (To be exact, there are 26^3 = 17,576.)

Will ignore most of them

What is network management? Computer networks are considered to have

three operating timescales Data: packet forwarding [ μs, ms ] Control: flows/connections [ secs, mins ] Management: aggregates, networks [ hours,days ]

…so we’re concerned with “the network” rather than particular devices

Standardization is key!

Overview

Introduction Abstractions

ISO FCAPS, TMN EMS, ATM IP network components IP network management protocols Pulling it all together An alternative approach

ISO FCAPS: functional separation Fault

Recognize, isolate, correct, log faults Configuration

Collect, store, track configurations Accounting

Collect statistics, bill users, enforce quotas Performance

Monitor trends, set thresholds, trigger alarms Security

Identify, secure, manage risks

TMN EMS: administrative separation Telecommunications Management Network Element Management System

“...simple but elegant...” (!) (my emphasis)

NEL: network elements (switches, transmission systems) EML: element management (devices, links) NML: network management (capacity, congestion) SML: service management (SLAs, time-to-market) BML: business management (RoI, market share, blah)

The B-ISDN reference model

Asynchronous Transfer Mode “cube” See IAP lectures, maybe

Plane management… The whole network

…vs layer management Specific layers

Topology Configuration Fault Operations Accounting Performance

management plane

user planecontrol plane

higher layers

ATM layer

physical layer

plane m

anagem

ent

higher layers

ATM adaptation layer

layer managem

ent

Network management

Models of general communication networks Tend to be quite abstract and exceedingly tedious! Many practitioners still seem excited about OO

programming, WIMP interfaces, etc …probably because implementation is hard due to so

many excessively long and complex standards! My view: basic “need-to-know” requirements are

1. What should be happening? [ c ]

2. What is happening? [ f, p, a ]

3. What shouldn’t be happening? [ f, s ]

4. What will be happening? [ p, a ]

Network management

We’ll concentrate on IP networks Still acronym city: ICMP, SNMP, MIB, RFC Sample size: 102 routers, 105 hosts

We’ll concentrate on the network core Routers, not hosts

We’ll ignore “service management” DNS, AD, file stores, etc

Overview

Introduction Abstractions IP network components

IP primer, router configuration IP network management protocols Pulling it all together An alternative approach

IP primer (you probably know all this) Destination-routed packets – no connections

Time-to-live field: allow removal of looping packets Routers forward packets based on routeing tables

Tables populated by routeing protocols Routers and protocols operate independently

…although protocols aim to build consistent state

RFCs ~= standards Often much looser semantics than e.g. ISO, ITU standards Compare for example OSPF [RFC2327] and IS-IS

[RFC1142, RFC1195], two link-state routeing protocols

So, how do you build an IP network?1. Buy (lease) routers

2. Buy (lease) fibre

3. Connect them all together

4. Configure routers appropriately

5. Configure end-systems appropriately

Assume you’ve done 1–3 and someone else is doing 5…

Router configuration

Initialization Name the router, setup boot options, setup authentication options

Configure interfaces Loopback, ethernet, fibre, ATM Subnet/mask, filters, static routes Shutdown (or not), queueing options, full/half duplex

Configure routeing protocols (OSPF, BGP, IS-IS, …) Process number, addresses to accept routes from, networks to advertise

Access lists, filters, ... Numeric id, permit/deny, subnet/mask, protocol, port

Route-maps, matching routes rather than data traffic Other configuration aspects: traps, syslog, etc

Router configuration fragmentshostname FOOBAR!boot system flash slot0:a-boot-image.binboot system flash bootflash:logging buffered 100000 debugginglogging console informationalaaa new-model aaa authentication login default tacacs local aaa authentication login consoleport none aaa authentication ppp default if-needed tacacs aaa authorization network tacacs !ip tftp source-interface Loopback0no ip domain-lookupip name-server 10.34.56.78!ip multicast-routingip dvmrp route-limit 7000ip cef distributed

interface Loopback0 description router-1.network.corp.com ip address 10.65.21.43 255.255.255.255!interface FastEthernet0/0/0 description Link to New York ip address 10.65.43.21 255.255.255.128 ip access-group 175 in ip helper-address 10.65.12.34 ip pim sparse-mode ip cgmp ip dvmrp accept-filter 98 neighbor-list 99 full-duplex!interface FastEthernet4/0/0 no ip address ip access-group 183 in ip pim sparse-mode ip cgmp shutdown full-duplex

router ospf 2 log-adjacency-changes passive-interface FastEthernet0/0/0 passive-interface FastEthernet0/1/0 passive-interface FastEthernet1/0/0 passive-interface FastEthernet1/1/0 passive-interface FastEthernet2/0/0 passive-interface FastEthernet2/1/0 passive-interface FastEthernet3/0/0 network 10.65.23.45 0.0.0.255 area 1.0.0.0 network 10.65.34.56 0.0.0.255 area 1.0.0.0 network 10.65.43.0 0.0.0.127 area 1.0.0.0

access-list 24 remark Mcast ACLaccess-list 24 permit 239.255.255.254access-list 24 permit 224.0.1.111access-list 24 permit 239.192.0.0 0.3.255.255access-list 24 permit 232.192.0.0 0.3.255.255access-list 24 permit 224.0.0.0 0.0.0.255access-list 1011 deny 0000.0000.0000 ffff.ffff.ffff ffff.ffff.ffff 0000.0000.0000 0xD1 2 eq 0x42access-list 1011 permit 0000.0000.0000 ffff.ffff.ffff 0000.0000.0000 ffff.ffff.ffff

tftp-server slot1:some-other-image.bintacacs-server host 10.65.0.2tacacs-server key xxxxxxxxrmon event 1 trap Trap1 description "CPU Utilization>75%" owner configrmon event 2 trap Trap2 description "CPU Utilization>95%" owner config

Lots of quite large and fragile text files 00s/000s routers, 00s/000s lines per config Errors are hard to find and have non-obvious results Router configuration also editable on-line

How to keep track of them all? Naming schemes, directory hierarchies, CVS ssh upload and atomic commit to router Perhaps even a database

State of the art is pretty basic Few tools to check consistency Generally generate configurations from templates and have

human-intensive process to control access to running configs Topic of current research [Feamster et al]

Router configuration

this counts as quite advanced!

Overview

Introduction Abstractions IP network components IP network management protocols

ICMP, SNMP, Netflow Pulling it all together An alternative approach

ICMP

Internet Control Message Protocol [RFC792] IP protocol #1 In-band “control”

Variety of message types echo/echo reply [ PING (packet internet groper) ] time exceeded [ TRACEROUTE ] destination unreachable, redirect source quench

Ping (Packet INternet Groper) Test for liveness

…also used to measure (round-trip) latency Send ICMP echo Valid IP host [RFC1122, RFC1123] must reply with

ICMP echo response

Subnet PING? Useful but often not available/deprecated “ACK” implosion could be a problem RFCs ~= standards

Traceroute

Which route do my packets take to their destination? Send UDP packets with increasing time-to-live values Compliant IP host must respond with ICMP “time exceeded” Triggers each host along path to so respond

Not quite that simple One router, many IP addresses: which source address?

Router control processor, inbound or outbound interface Routes often asymmetric, so return path != outbound path Routes change

Do we want full-mesh host-host routes anyway?! Size of data set, amount of probe traffic This is topology, what about load on links?

SNMP

Protocol to manage information tables at devices Provides get, set, trap, notify operations

get, set: read, write values trap: signal a condition (e.g. threshold exceeded) notify: reliable trap

Complexity mostly in the MIB design Some standard tables, but many vendor specific Non-critical, so often tables populated incorrectly Many tens of MIBs (thousands of lines) per device Different versions, different data, different semantics

Yet another configuration tracking problem Inter-relationships between MIBs

IPFIX

IETF working group Export of flow based data out of IP network devices Developing suitable protocol based on Cisco NetFlow™ v9 [RFC3954, RFC3955]

Statistics reporting Setup template Send data records matching template

Many variables Packet/flow counters, rule matches, quite flexible

Overview

Introduction Abstractions IP network components IP network management protocols Pulling it all together

Network mapping, statistics gathering, control An alternative approach

An hypothetical NMS

GUI around ICMP (ping, traceroute), SNMP, etc Recursive host discovery

Broadcast ping, ARP, default gateway: start somewhere Recursively SNMP query for known hosts/connected networks Ping known hosts to test liveness Iterate

Display topology: allow “drill-down” to particular devices Configure and monitor known devices

Trap, Netflow™, syslog message destinations Counter thresholds, CPU utilization threshold, fault reporting Particular faults or fault patterns Interface statistics and graphs

A real NOC (Network Operations Centre)[ from AT&T ]

An hypothetical NMS

All very straightforward? No, not really A lot of software engineering: corner cases, traceroute

interpretation, NATs, etc MIBs may contain rubbish Can only view inside your network anyway

Efficiency Rate pacing discovery traffic: ping implosion/explosion SNMP overloading router CPUs

Tunnelled, encrypted protocols becoming prevalent Using NMSs also not straightforward

How to setup “correct” thresholds? How to decide when something “bad” has happened? How to present (or even interpret) reams and reams of data?

Overview

Introduction Abstractions IP network components IP network management protocols Pulling it all together An alternative approach

From the edges…

Edge-based network management platform Collect flow information from hosts, and Combine with topology information from routeing protocols

Enable visualization, analysis, simulation, control

Avoid problems of not-quite-standard interfaces Management support is typically ‘non-critical’ (i.e. buggy )

and not extensively tested for inter-operability Do the work where resources are plentiful

Hosts have lots of cycles and little traffic (relatively) Protocol visibility: see into tunnels, IPSec, etc

ENMA

System outline

Control

Packets

Flows

Routeingprotocol

Topology

VisualizeSimulate

Simulator

Distributeddatabase

Traffic matrix Set of routes

srcs

dsts

routes

Pictures of current topology and traffic Routes+flows+forwarding rules BIG PICTURE

In fact, where did my traffic go yesterday? Keep historical data for capacity planning, etc

A platform for anomaly detection Historical data suggests “normality,” live

monitoring allows anomalies to be detected

Where is my traffic going today?

Where might my traffic go tomorrow? Plug into a simulator back-end

Discrete event simulator, flow allocation solver Run multiple ‘what-if’ scenarios

…failures …reconfigurations …technology deployments

E.g. “What happens if we coalesce all the Exchange servers in one data-centre?”

Where should my traffic be going? Close the loop: compute link weights to

implement policy goals Recompute on order of hours/days

Allows more dynamic policies Modify network configuration to track e.g. time of

day load changes Make network more efficient (~cheaper)?

Where are we now?

Three major components Flow collection Route collection Distributed database

Building prototypes, simulating system

Data collection

Flow collection Hosts track active flows

Using low overhead event posting infrastructure, ETW Built prototype device driver provider & user-space consumer

Used packet traces for feasibility study on (client, server) Peaks at (165, 5667) live and (39, 567) active flows per sec

Route collection OSPF is link-state: passively collect link state adverts Extension of my work at Sprint (for IS-IS and BGP); also

been done at AT&T (NSDI’04 paper)

The distributed database

Logically contains 1. Traffic flow matrix (bandwidths), {srcs} × {dsts}2. …each entry annotated with current route from src to dst

N.B. src/dst might be e.g. (IP end-point, application) Large dynamic data set suggests aggregation

Related work { distributed, continuous query, temporal } databases Sensor networks

Potential starting points: Astrolabe or SDIMS (SIGCOMM’04) Where/what/how much to aggregate?

Is data read- or write-dominated? Which is more dynamic, flow or topology data? Can the system successfully self-tune?

The distributed database

Construct traffic matrix from flow monitoring Hosts can supply flows they source and sink Only need a subset of this data to get complete traffic matrix

Construct topology from route collection OSPF supplies topology → routes

Wish to be able to answer queries like “Who are the top-10 traffic generators?”

Easy to aggregate, don’t care about topology “What is the load on link l ?”

Can aggregate from hosts, but need to know routes “What happens if we remove links {l…m} ?”

Interaction between traffic matrix, topology, even flow control

The distributed database

Building simulation model OSPF data gives topology, event list, routes Simple load model to start with (load ~ # subnets) Precedence matrix (from SPF) reduces flow-data query set

Can we do as well/better than e.g. NetFlow? Accuracy/coverage trade-off

How should we distribute the DB? Just OSPF data? Just flow data? A mixture?

How many levels of aggregation? How many nodes do queries touch?

What sort of API is suitable? Example queries for sample applications

Summary

Introduction What is network management?

Abstractions ISO FCAPS, TMN EMS, ATM

IP network components IP, routers, configurations

IP network management protocols ICMP, SNMP, etc

Pulling it all together Outline of a network management system

An alternative approach: from the edges

The end

Questions Answers?

http://www.cisco.com/ http://www.routergod.com/ http://www.ietf.org/ http://ipmon.sprintlabs.com/pyrt/ http://www.nanog.org/

Backup slides

Internet routeing OSPF BGP

Internet routeing

Q: how to get a packet from node to destination?

A1: advertise all reachable destinations and apply a consistent cost function (distance vector)

A2: learn network topology and compute consistent shortest paths (link state) Each node (1) discovers and advertises adjacencies;

(2) builds link state database; (3) computes shortest paths A1, A2: Forward to next-hop using longest-prefix-

match

OSPF (~link state routeing)

Q: how to route given packet from any node to destination?

A: learn network topology; compute shortest paths

For each node Discover adjacencies (~immediate neighbours); advertise Build link state database (~network topology) Compute shortest paths to all destination prefixes Forward to next-hop using longest-prefix-match (~most

specific route)

BGP (~path vector routeing)

Q: how to route given packet from any node to destination? A: neighbours tell you destinations they can reach; pick cheapest

option

For each node Receive (destination, cost, next-hop) for all destinations known to

neighbour Longest-prefix-match among next-hops for given destination Advertise selected (destination, cost+, next-hop') for all known

destinations Selection process is complicated Routes can be modified/hidden at all three stages

General mechanism for application of policy