1 State of Network Monitoring and Analysis in the US Les Cottrell, KC Claffy, Brian Tierney, Ronn Ritke, Hans-Werner Braun Prepared for the LSN meeting

1

State of Network Monitoring and Analysis in

the US

Les Cottrell, KC Claffy, Brian Tierney, Ronn Ritke, Hans-Werner

BraunPrepared for the LSN meeting at NSF Washington

6/10/03Partially funded by DOE/MICS Field Work

Proposal on Internet End-to-end Performance Monitoring (IEPM), by the SciDAC base program

2

Outline• Goal: for network monitoring & analysis talk:

– identify the R&D gaps and large-scale deployment issues for DOE, NSF, DARPA, NASA, NSA, NIST, etc. – the federal agencies that fund network research in US

• Two complementary presentations– High performance networking measurement needs for Science (E2E)-

Les– Consumer grade & net-centric measurement needs – kc

• Science network measurement needs– The end-to-end challenge, illustrations– Solution– End to end Monitoring Goals– Current issues

• Problem analysis, measurement infrastructure, analysis tools, standards, collaborations

– Benefits to Science– Consequences of not addressing issues– Why not leave to industry– Appendix

• What is being done today– Who is measuring?– Who is using the measurements?– What is being measured?– What tools are being used?

3

The Problem• Distributed systems are very hard

– A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson

• When building distributed systems, we often observe unexpectedly low performance

• the reasons for which are usually not obvious

• The bottlenecks can be in any of the following components:– the applications– the operating systems– the disks, network adapters, bus, memory, etc. on either the

sending or receiving host– the network switches and routers, and so on

• Problems may not be logical– Most problems are operator errors, configurations, bugs

4

Anatomy of a Problem

Applications Developer

System Administrator

LAN Administrator

CampusNetworking

Gigapop Gigapop

Backbone

CampusNetworking

LAN Administrator

System Administrator

Applications Developer

How do you solvea problem along a path?

Hey, this is not

working right!

The computerIs working

OK

Talk to the other guys

Everything isAOK

No othercomplaints

The network is lightly loaded

All the lights are green

We don’t see anything wrong

Looks fine

Others are getting in ok

Not our problem

5

Problem examples: Help, it’s not working• I’ve lost my connection • Despite over-provisioned

networks user cannot get throughput expected– Wizard gap

• What should I expect the performance to be?

• It sometimes works …• What am I, as a scientist,

supposed to do?• Need

tools/measurements to detect problems, identify location, cause and time of occurrence

Is Grid server down, is the network partitioned, is there heavy congestion, did DNS fail, is a firewall preventing access …

WizardWizard

Typical userTypical userMb

its/

sM

bit

s/s

6

The Solution• A complete End-to-End monitoring framework that includes the

following components: – instrumentation tools (application, middleware, and OS monitoring)– host and network sensors (host and network monitoring)– sensor management / activation tools– event publication service– event archive service– event analysis and visualization tools– a common set of protocols for describing, exchanging, and locating

monitoring data• Need for applications (e.g. Grid middleware), diagnosis, perf. analysis

– toolkit for streamlined problem diagnosis: detection, location, isolation & reporting

• glue to multiple sources of information, traceroute archives, router info, delay/loss archives, on-demand tests, baselines

• analysis and heuristics– E2EPi working on solution, but only funded for

coordination not for all the underlying work

7

End-2-End Monitoring Goals• Have to solve the E2E performance, it is THE

critical metric for user, not just a backbone bandwidth problem

• Improve end-to-end data throughput for data intensive applications in a high-speed WAN environments

• Provide the ability to do performance analysis and fault detection in a Grid computing environment

• Provide accurate, detailed, and adaptive monitoring of all of distributed computing components, including the network

• Unfortunately, network management research has historically been very under-funded, because it is difficult to get funding bodies to recognize this as legitimate networking research, IAB Concerns & Recommendations Regarding Internet Research & Evolution

8

Current Issues 1: Problem Analysis• Cultivate systematic studies of problems,

causes, how to discover, how to report, how to by-pass– Analysis to help in deciding what are the

most important problems, see how they are tackled manually today

– Decide on which problems are most cost-effective to assist in developing tools to assist in diagnosis

9

Current issues 2: Measurement Infrastructures

• Need to build infrastructure to support troubleshooting: – Requires repetitive and on-demand measurements with

appropriate security model. – Provide recommended/accepted set of tools for delay, RTT,

loss, route tracking, "bandwidth" estimation. • Include archiving and access to data, analysis and reporting of

repetitive data. – Allow for evaluation, validation and comparison of new

measurement tools, TCP stacks, applications (e.g. file transfer).– Reverse traceroute, looking glass, remote tcpdump (e.g.

SCNM), remote testing of connection (ANL NDT), – Traceroute archives– Make tools easier to comprehend and use by scientists– Encourage efforts such as Internet2 E2Epi efforts to provide

measurements inside the cloud• Extend to ESnet & other NRNs, and beyond• Fund collaboration across boundaries

– Ubiquitous coverage (require multiple toolkits): Inter agency, international, hi-speed, digital divide, long term and current

10

Current issues 3: Analysis tools• Provide measurement tools to accurately

& quickly identify performance problems, – to automatically take action to investigate

and provide information for:• Scientist• Grid support “NOC”• Network administrator or network person

– Promote well understood, accepted metrics for customers for realistic, enforceable SLAs, • provide acceptable limits, • provide tools to track

11

Current issues 4: Standards• All the above requires:

– easy to use standard ways (e.g.web services) for applications to access data from existing and new monitoring projects.

– standard naming conventions and schemas.

• This will provide the ability to share information from multiple measurement infrastructure projects

12

Current issues 5: Collaboration• Need to build multi-disciplinary teams (incent

orthogonal groups to work with one another): – include people close to eventual customers

(scientists, operational folks)• to ensure what is developed is useful, tested out in realistic

environments

– include vendors and providers in funded projects to bridge the gaps

• E2Epi is funded to provide coordination• Multi agency funding!

– This is not a problem a single agency can address– Science applications cross multi-agency networks,

but barriers to interagency network monitoring collaborations

13

Benefits to Science• Network reaches its potential

– enable new ways of doing science: • data intensive science (astrophysics, global

weather, seismology, medicine), • remote instrument control (SNS, fusion(ITER),

surgery), • remote visualization/insight (Terascale supernova,

climate modeling), • world-wide collaboration enabling (LHC, ITER)

– enables scientists to do science• Wizard gap closure, not fighting the network,

network becomes a catalyst

– Without good troubleshooting capabilities, the Grid vision will fail

– Predictability, planning, expectations, raising the bar

14

What happens if we do not address• Data continues to ship inefficiently by

truck/plane FedEx– Long delays (2 weeks), degraded collaboration, US

scientists continue to lose leadership– Increased costs (manpower costs, lack of

automation)

• Inadequate reliability or performance for new applications, (e.g. Grid fails to reach its potential)

• New capabilities do not emerge in US:– remote instrument control, real-time video,

media distribution… – US science loses leadership to Japan, Europe,

Canada

15

Why not leave it to industry• Industry won’t do it (“it’s not my problem”):

– Has its interest and hands full elsewhere– It’s hard, does not sell products, little Return on

Investment– Historically poor record, competitive concerns

• Management features are late in product development cycle

• Early success with SNMP and Netflow• Commercial Network Management Platforms’s (e.g.

OpenView, Tivoli) limited success (network oriented, not user), not cost effective

• ISPs only measure own nets, not E2E, SLA guarantees are not cross-provider

16

More Information• Some Measurement Infrastructures:

– CAIDA list: www.caida.org/analysis/performance/measinfra/– AMP: amp.nlanr.net/, PMA http://pma..nlanr.net– IEPM/PingER home site: www-iepm.slac.stanford.edu/– IEPM-BW site: www-iepm.slac.stanford.edu/bw– NIMI: ncne.nlanr.net/nimi/– RIPE: www.ripe.net/test-traffic/– NWS: nws.cs.ucsb.edu/– Internet2 PiPES: e2epi.internet2.edu/

• Tools– CAIDA measurement taxonomy: www.caida.org/tools/– SLAC Network Tools: www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html

• Internet research needs:– www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt

17

Appendix: Current Practices

18

Who is Measuring?• CAIDA (skitter, macroscopic …)• NLANR (e.g. AMP – active, PMA – passive)• LBL (e.g. netest • SLAC/FNAL (e.g. PingER, IEPM-BW)• PSC (NIMI)• RICE (INCITE)• Europe: RIPE (Eu ISPs), PPMCG• NWS• Internet2 (PiPES, IETF/IPPM, Netflow)• Sprint, ATT Research• Commercial (Keynote, Matrix, internetweather…)• For more see www.caida.org/analysis/performance/measinfra

19

Who are using measurements (customers)?• Users

– “Why is the performance not what I would like or expect”• Set expectations, build case to complain to ISP

– What should I expect, what applications are likely to work• Planners: observe growth, decide when upgrades are

needed, make cases for upgrades• Network engineers: pin-point problem, provide information

to providers• Providers: “where is the problem and what is it”, best bang

for the buck• Grid applications users/developers look forward to using,

– e.g. Grid Resource Broker data placement• Requires APIs (e.g. web services), common naming conventions (e.g.

NMWG, GLUE schema …) etc.

• Security: anomalies• Researchers: modeling, theory testing, scaling laws

20

What is being Measured 1/2Purpose Operations Research

End-to-end Ping/traceroute, bandwidth, application performance

Band width estimation

Network centric

SNMP, MRTG, Netflow …

Topology / tomography, mapping, security

Other taxonomies: active vs passive

21

What is being measured 2/2?• Delays, RTT, loss, jitter, availability • “Bandwidth” estimation

– TCP & UDP throughputs– Packet pair techniques – Packet length techniques (pchar …)

• Topology /tomography, routing• Utilization, errors • Security• Evaluation of new protocols• Applications (many commercial packages)

– Email, DB, www …

• One off: traffic characterization at borders and IXPs– Exception, providers do not make information public

22

What tools are being used• Delays etc.: ping, OWAMP, GPS

• “Bandwidth”: iperf, pathload, pipechar, netest, ABwE

• Utilization: SNMP

• Topology/tomography: traceroute, skitter, INCITE

• Routing: RIPE, routeviews

• Traffic characterization: netflow, NeTraMet, tcpdump, coralreef

• Visualization: MRTG, RRD, netgeo, geoplot, tcptrace, xplot

Documents

1 State of Network Monitoring and Analysis in the US Les Cottrell, KC Claffy, Brian Tierney, Ronn Ritke, Hans-Werner Braun Prepared for the LSN meeting