View
25
Download
1
Category
Tags:
Preview:
DESCRIPTION
ESnet NMTF/NMFG - Status. Les Cottrell, SLAC & Dave Martin, HEPNRC < cottrell@slac.stanford.edu >, < dem@hep.net > Presented at the ESCC Meeting, JLAB , Oct 1997. Outline of Talk. What happened to the NMTF/NMFG? What are we measuring? How are we measuring? - PowerPoint PPT Presentation
Citation preview
/afs/slac/u/sf/cottrell/talk/escc/oct97 1
ESnet NMTF/NMFG - Status
Les Cottrell, SLAC & Dave Martin, HEPNRC
<cottrell@slac.stanford.edu>, <dem@hep.net>
Presented at the ESCC Meeting, JLAB, Oct 1997
/afs/slac/u/sf/cottrell/talk/escc/oct97 2
Outline of Talk
• What happened to the NMTF/NMFG?
• What are we measuring?
• How are we measuring?
• Tools we are using/developing
• Coordination with others
• Next Steps
• Summary
/afs/slac/u/sf/cottrell/talk/escc/oct97 3
What happened to the NMTF/NMFG?• It evolved
– Some of original members (BNL & ORNL) were unable to continue effort
– SLAC& HEPNRC retained focus on monitoring– ICFA concerned about impact of network performance
on HENP research• Created NTF with various WG, one on Monitoring
• More focus on HENP issues and International links
• Embraced work done by NMTF/NMFG and supported continued development
• Brought in new partners, in particular INFN, CERN as well as other collection sites
/afs/slac/u/sf/cottrell/talk/escc/oct97 4
Mission etc. of the ICFA-NTF WG on Monitoring
• Mission of Group:– Obtain as uniform picture as possible of the present
performance of the connectivity used by the ICFA community
• Two meetings so far, CHEP97 (Apr-97), & Santa Fe (Sep-97)
• Produced an interim status report for Sep-97
• Will update for Dec-97, with a final report Apr-98.
/afs/slac/u/sf/cottrell/talk/escc/oct97 5
Our Main Metric is Ping
• “Universally available”, easy to understand– no software for clients to install
• Low network impact
• Provides loss, response time, reachability, unpredictability
• select hosts carefully, concerns over routers, loaded hosts etc. (provide guidelines)
• does provide useful measures
/afs/slac/u/sf/cottrell/talk/escc/oct97 6
Ping Response Time vs Bytes
/afs/slac/u/sf/cottrell/talk/escc/oct97 7
Ping Response vs Web Response
y = 1.7135x + 719.83y = 2.5726x
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 100 200 300 400 500
Minimum Ping Response in msec.
GE
T R
esp
on
se i
n m
sec.
y = 2x
y = 1.71x + 720
y = 2.57x
HT
TP
GE
T R
espo
nse
(ms)
Minimum Ping Response (ms)
/afs/slac/u/sf/cottrell/talk/escc/oct97 8
Method– Measurement
• Each Collection site keeps list of remote hosts to ping at sites it is interested in
• Every 30 mins ping each remote host with 11 * 100 byte followed by 10 * 1000 byte pings
• Min separation of pings is 1 second, timeout 20 seconds
• Throw away first ping
• Measure response, packet loss, host unreachable (no answer to any ping)
• Record data and make available
/afs/slac/u/sf/cottrell/talk/escc/oct97 9
Architecture• Three Types of Sites
– Remote Sites - need only to respond to ping packets– Collecting Sites
• Collecting Data: Perl Script Pings Nodes, Records Data in common documented format
• Serving Data: CGI/Perl Script makes Data Available to Analysis Sites
• WWW CGI tools make reports available
– Analysis Sites• Retrieving Data: Perl Script Retrieves Data from Collecting Sites
• Analysis: SAS Program Analyzes Data and Generates Graphs
• Reports: WWW Form Makes Customized Reports Available
/afs/slac/u/sf/cottrell/talk/escc/oct97 10
Architecture
WWWWWW
AnalysisAnalysis AnalysisAnalysis
CollectingCollecting
CollectingCollecting
CollectingCollectingCollectingCollecting
RemoteRemote
RemoteRemoteRemoteRemote
RemoteRemote
HTTP
Pings
E.g. HEPNRC E.g. SLAC
Archive
Reports &Data
Cache
/afs/slac/u/sf/cottrell/talk/escc/oct97 11
Available Tools - Data Collection
• Collect data (timeping) – HEPNRC rearchitected, developed & documented– Deployed at 12 sites in 6 countries
• ARM, BNL, CERN, CMU, DoE/GMTN, HEPNRC/FNAL, INFN/CNAF. KEK, Hungary, RAL, SLAC, UMD
– DESY, IN2P3, TRIUMF, MSU, Beijing also expressed interest, plus commercial sites
• Data available (pingdata) in common format– Data collected available from collection site via HTTP– Allows data for specific times to be retrieved
/afs/slac/u/sf/cottrell/talk/escc/oct97 12
Current Deployment
ESnet Site (monitored from SLAC)N. American Site ( “ “ )International Site ( “ “ )
Monitoring Site
HEPNRC/FNALRAL
INFN/CNAF
CERN
RMKI/KFKIBNL
KEK
CMU
UMD
SLAC
DESY
/afs/slac/u/sf/cottrell/talk/escc/oct97 13
Analysis / Archive Site
• Gathers & archives data– HEPNRC gathers data from collection sites a few times
daily– Archives the data (200 Mbytes/month)– Works with collection sites to resolve problems– Provide Web access to archive data via form (ping_data.
pl)
/afs/slac/u/sf/cottrell/talk/escc/oct97 14
Access to Raw Data
/afs/slac/u/sf/cottrell/talk/escc/oct97 15
Analysis / Archive Site
• Gathers & archives data– HEPNRC gathers data from collection sites a few times
daily– Archives the data (200 Mbytes/month)– Works with collection sites to resolve problems– Provide Web access to archive data via form
(ping_data.pl)
• Provide Web form to allow simple plotting (graph_pings.pl), uses SAS for speed
/afs/slac/u/sf/cottrell/talk/escc/oct97 16
Form to Select Analysis Graphs
/afs/slac/u/sf/cottrell/talk/escc/oct97 17Generated by HEPNRC @ Fermilab, 02SEP97
0
100
200
300
400
500
0
10
20
30
40
50
60
70
80
90
100
/afs/slac/u/sf/cottrell/talk/escc/oct97 18
Analysis Tools for Collection Sites
• Short-term analysis / reports– Recent data (e.g. last 30 days cached)
• Web sortable table of latest measurements, colored for quality
/afs/slac/u/sf/cottrell/talk/escc/oct97 19
Ping Loss QualityPing Loss Quality
Distributions for Host Groups
0%
10%
20%
30%
40%
50%
60%
70%
Esnet
ISP L
ocal
Inte
rnat
ional
NAmer
icaE
NAmer
icaW
Per
cen
tile
<= 1% Loss (==Good)>1% & <=5% Loss (==Acceptable)>5% & <=12% Loss (==Poor)> 12% & <=25% Loss (==Bad)>25% Loss (==Unusable)
(76, 5.46) (183, 7.18)(150, 0.79)
(199, 6.3) (188, 6.21)
(host-months, median loss)
0 -1% Good, 1-5% Acceptable, 5-12% Poor,
12-25% Poor, > 25% Unusable
Similar to Internet Weather Report (<6%, <12%, > 12%)
/afs/slac/u/sf/cottrell/talk/escc/oct97 20
Analysis Tools for Collection Sites
• Short-term analysis / reports– Recent data (e.g. last 30 days cached)
• Web sortable table of latest measurements, colored for quality, with output (TSV) for Excel (connectivity.pl)
/afs/slac/u/sf/cottrell/talk/escc/oct97 21
Latest Ping Measurements
/afs/slac/u/sf/cottrell/talk/escc/oct97 22
Raw Data from last 24 Hours
/afs/slac/u/sf/cottrell/talk/escc/oct97 23
Latest Ping Measurements
/afs/slac/u/sf/cottrell/talk/escc/oct97 24
Ping Performance for Last 180 Days
/afs/slac/u/sf/cottrell/talk/escc/oct97 25
Analysis Tools for Collection Sites
• Short-term analysis / reports– Recent data (e.g. last 30 days cached)
• Web sortable table of latest measurements, colored for quality, with output (TSV) for Excel (connectivity.pl)
• Web form to select sites and time frames to be plotted (ping_data_plot.pl)
/afs/slac/u/sf/cottrell/talk/escc/oct97 26
Request Plot of Collection Site Data
/afs/slac/u/sf/cottrell/talk/escc/oct97 27
Plot from Collection Site
/afs/slac/u/sf/cottrell/talk/escc/oct97 28
Tools in Development
• Re-engineering SLAC long term reports– exception report
/afs/slac/u/sf/cottrell/talk/escc/oct97 29
Exception Reports
Color highlightsextent of exception
Click here toburrow down tomore information
Last 10 Weeks Ping Data
Nodename
Mean 10wks % Loss
Mean 1 wk % Loss
Stdev 10wks Loss
# Std From Mean
Mean 10wks Avg (ms)
Mean 1 wk Avg (ms)
BNL.GOV 0.1 0.4 0.1 2.4 65.4 66.9CALTECH.EDU 0.3 0.1 0.6 -0.3 22.9 22.7CEBAF.GOV 0.1 0 0.2 -0.3 62.9 57.6CERN.CH 4 5.7 1.5 1.2 242.9 236.9
Click to sort by column
/afs/slac/u/sf/cottrell/talk/escc/oct97 30
Tools in Development
• Re-engineering SLAC long term reports– exception report– last 180 days
/afs/slac/u/sf/cottrell/talk/escc/oct97 31
180 Days SLAC - Stanford
Uwave &
Routing problems
Direct connect
20 ms 5.5ms
Loss < 1%
Via ESnet
Loss 3-6% 30ms
Feb-
97 Aug-
97
/afs/slac/u/sf/cottrell/talk/escc/oct97 32
Tools in Development
• Re-engineering SLAC long term reports– exception report– last 180 days– monthly points going back for years in tabular form
with quality coloring, sorting & hyperlinks• Loss (by site, and by group of sites)
• Response ( “ “ )
• Reachability ( “ “ )
• % time network “Quiescent” or “Busy”
/afs/slac/u/sf/cottrell/talk/escc/oct97 33
Ping Loss History
/afs/slac/u/sf/cottrell/talk/escc/oct97 34
TSV Output to Excel for Further Analysis
/afs/slac/u/sf/cottrell/talk/escc/oct97 35
Ping Response by GroupMedian Monthly Prime Time Ping Response Time from Jan-95
thru Jul-97, seen from SLAC
0
50
100
150
200
250
300
350
400
450
Jan-
95
Mar
-95
May
-95
Jul-9
5
Sep-9
5
Nov-9
5
Jan-
96
Mar
-96
May
-96
Jul-9
6
Sep-9
6
Nov-9
6
Jan-
97
Mar
-97
May
-97
Jul-9
7
Me
dia
n P
ing
Ro
un
d T
Rip
Re
sp
on
se
Tim
e f
or
10
0 b
yte
pin
gs
Esnet
N America W
N America E
International
Expon. (International)
Expon. (N America E)
Expon. (N America W)
Expon. (Esnet)
Main cause of apparent increase in Esnet response due to adding MIT to monitoring
Big contribution to Internationalfrom 2 sites (CN and SU).Without these 2 sites medianresponse time is closer to 300ms.
Added ihep.cn tomonitoring
Big improvement inrtesponse time toDresden
DESY got bad
/afs/slac/u/sf/cottrell/talk/escc/oct97 36
Prime-time Packet Loss by GroupMonthly Median Prime Time Ping Packet Loss by Group Jan-
95 thru Jul-97 seen from SLAC
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
Jan-
95
Mar-9
5
May-
95
Jul-9
5
Sep-95
Nov-95
Jan-
96
Mar-9
6
May-
96
Jul-9
6
Sep-96
Nov-96
Jan-
97
Mar-9
7
May-
97
Jul-9
7
Med
ian
Mo
nth
ly P
ing
Pac
ket
Lo
ss
Esnet
N. America W
N. America E
International
Expon. (Esnet)
Expon. (International)
Expon. (N. America E)
Expon. (N. America W)
/afs/slac/u/sf/cottrell/talk/escc/oct97 37
“Quiescent” Frequency by GroupMonthly Median Frequency for No Packet Loss by Group from
Jan-95 thru Jul-96, seen from SLAC
50
55
60
65
70
75
80
85
90
95
100
Jan-
95
Mar-9
5
May-
95
Jul-9
5
Sep-95
Nov-95
Jan-
96
Mar-9
6
May-
96
Jul-9
6
Sep-96
Nov-96
Jan-
97
Mar-9
7
May-
97
Jul-9
7
Med
ian
Mo
nth
ly P
erce
nta
ge
Fre
qu
ency
of
No
pac
ket
loss
Esnet
N America W
N America E
International
Colorado &US NW bad
Carleton, McGill& Virginia Techbad
Dresden, ETH& RAL bad
Esnet peers with vBNS
/afs/slac/u/sf/cottrell/talk/escc/oct97 38
International Site “Busy” FrequencyMonthly International Site Frequency of Non-zero loss (out of 10 pings) from Jan-9 thru Jul-97, from SLAC
0
10
20
30
40
50
60
70
80
90
% F
req
ue
nc
y o
f N
on
-ze
ro p
ing
s
CERN.CH
ETHZ.CH
IN2P3.FR
DESY.DE
PHY.TU-DRESDE
FZU.CZ
GE.INFN.IT
LNF.INFN.IT
NA.INFN.IT
PD.INFN.IT
ROMA1.INFN.IT
TS.INFN.
RL.AC.UK
KEK.JP
IHEP.AC.CN
INP.NSK.SU
Expon. (RL.AC.UK)
Expon.(ROMA1.INFN.IT)Expon. (CERN.CH)
Expon. (DESY.DE)
RL.UK
UK - US linkupgraded
Italian nodestrack & lookgood
CERN & IN2P3track
/afs/slac/u/sf/cottrell/talk/escc/oct97 39
Tools in Development
• Re-engineering SLAC long term reports– exception report– last 180 days– monthly points going back for years in tabular form with
quality coloring, sorting & hyperlinks• Loss (by site, and by group of sites)
• Response ( “ “ )
• Reachability ( “ “ )
• % time network “Quiescent” or “Busy”
• Ten Worst links in HEP
/afs/slac/u/sf/cottrell/talk/escc/oct97 40
Ten Worst HEP Links
Source Destination PingSize
% of TimeUnreachable
% ofPackets
L:ost
AvierageRT Delay
(ms)
StandardDeviation of
Avg RT Delaysgiserv.rmki.kfki.hu www.jinr.dubna.su 100 3.1 40.4 1305 576sgiserv.rmki.kfki.hu fnal.fnal.gov 100 0.5 39.8 670 173sgiserv.rmki.kfki.hu unixhub.slac.stanford.edu 100 1.0 38.4 710 173sgiserv.rmki.kfki.hu hepnrc.hep.net 100 1.6 38.1 677 178sgiserv.rmki.kfki.hu www.hep.anl.gov 100 0.5 37.9 670 170sgiserv.rmki.kfki.hu www.slac.stanford.edu 100 1.0 36.2 717 166sgiserv.rmki.kfki.hu w4.lns.cornell.edu 100 6.8 35.2 658 173dxcnaf.cnaf.infn.it www.jinr.dubna.su 100 0.6 33.5 933 512
hepnrc.hep.net www.phys.s.u-tokyo.ac.jp 100 20 30.8 242 36.7
Ranked by % Packets Lost
/afs/slac/u/sf/cottrell/talk/escc/oct97 41
What are Typical Uses
• Setting Expectations
• Service Level Contract
• Choosing ISPs
• Identifying problems, and verifying solutions
• Planning for upgrades
/afs/slac/u/sf/cottrell/talk/escc/oct97 42
Summary to Help Choose Upgrades
/afs/slac/u/sf/cottrell/talk/escc/oct97 43
Prime Time Packet Loss Jun-Aug 97Jun-Aug 1997 Prime time ping packet loss between SLAC &
60 sites
PHYSICS.UCLA.EDUSLAC.STANFORD.EDBNL.GOVCALTECH.EDUFNAL.GOV
NA.INFN.ITTS.INFN.
DESY .DE
GE.INFN.IT
NIC.ES.NET
LLNL.GOVLBL.GOV
UTEXAS.EDU
UTDALLAS.EDUPHYSICS.YALE.
NEVIS.COLUMBIAPRINCETON.EDU
PHYSICS.WISCNSCP .UMD.EDU
PHYSICS.PURDUEUCHICAGO.EDU
LNS.CORNELL.EDPHYSICS.LSA.U
PHYSICS.UPENNHEP .UIUC.EDU
HEP .UMN.EDU
PHY.OLEMISS.EPHYS.VT.EDU
PHYSICS.CARLETONPHA.J HU.EDU
PVAMU.EDUPHY.DUKE.EDU
ORNL.GOVMIT.EDU
CEBAF.GOV
KEK.J PPHY.TU-DRESDE
LNF.INFN.ITINP .NSK.SU
RL.AC.UK
ETHZ.CHROMA1.INFN.IT
CERN.CH
FZU.CZPD.INFN.IT
PHYSICS.MCGILL.C PAS.ROCHESTE
UTK.EDUHARVARD.EDU
UCDAVIS.EDU
UCSC.EDUARM.GOV
COLORADO.EDUUOREGON.EDU
PHYSICS.UCSB.EWASHINGTON.EDU
PS.UCI.EDUTRIUMF.CA
UCSD.EDUSTANFORD.EDU
PHS.UC.EDU
IN2P3.FRIHEP .AC.CN
PHYSICS.COLOS
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
PHYSICS.UCLA.EDU
BNL.GOV
FNAL.GOV
LBL.GOV
ORNL.GOV
CEBAF.GOV
TS.INFN.
GE.INFN.IT
PHY.TU-DRESDE
INP.NSK.SU
ROMA1.INFN.IT
CERN.CH
IHEP.AC.CN
PD.INFN.IT
PHYSICS.YALE.
NEVIS.COLUMBIA
PHYSICS.WISC
PHYSICS.PURDUE
LNS.CORNELL.ED
PHYSICS.UPENN
HEP.UMN.EDU
PHYS.VT.EDU
PHA.J HU.EDU
PHY.DUKE.EDU
PAS.ROCHESTE
HARVARD.EDU
UCDAVIS.EDU
ARM.GOV
UOREGON.EDU
WASHINGTON.EDU
TRIUMF.CA
STANFORD.EDU
Sit
e w
ith
in g
rou
p r
anke
d b
y p
ing
pac
ket
loss
100 byte % ping packet loss
Aug-97
Jul-97
Jun-97
N. America W
N. America E
International
ESnet
/afs/slac/u/sf/cottrell/talk/escc/oct97 44
Coordination etc.
• XIWT/IPWT Interest/deployment
/afs/slac/u/sf/cottrell/talk/escc/oct97 45
XIWT/IPWT interest• Austin meeting in Sep-97
– available tools presented by developers: IWR, CAIDA/NLANR, Intel, Auto Industry/Bellcore, IETF/IPPM Surveyor …
• XIWT/IPWT want to:– Measure performance of members' own networks– Get tests to validate and understand what to recommend to
other commercial customers and for what purposes. – Build a community within XIWT so can evolve it to
address harder issues.
• Selected our tools to initially deploy at 6 sites– includes Intel, SBC, HAI, BellSouth, CNRI, NIST
/afs/slac/u/sf/cottrell/talk/escc/oct97 46
Coordination etc.
• XIWT/IPWT Interest/deployment
• MICS funded joint SLAC/LBL proposal on Internet End-to-end performance monitoring for 1 year
• LBL/NIMI project
/afs/slac/u/sf/cottrell/talk/escc/oct97 47
NIMI (1)• NIMI=National Internet Measurement
Infrastructure, collaboration LBL/PSC (V. Paxson, M Mathis, J. Mahdavi).
• It is a software suite (not hardware). Deploy on “measurement hosts” around the Internet for black box infrastructure measurements.
• Ready for deployment Nov-97. Perl daemon with treno, Poisson packet generation for loss & delays.
• Hooks for other tools such as pathchar, tcpanaly.
/afs/slac/u/sf/cottrell/talk/escc/oct97 48
NIMI (2)• Challenges: accurate clock synchronization (one
way measurements), scaling to millions of nimids (nb end-to-end measurement strategies are usually not cost free, some things may be over-measured), data retrieval, new measurement strategies.
• There is no central management.
• Both HEPNRC & SLAC plan to install NIMI hosts (PCs running FreeBSD) at their sites
/afs/slac/u/sf/cottrell/talk/escc/oct97 49
Coordination etc.
• XIWT/IPWT interest/deployment
• MICS funded joint SLAC/LBL proposal on Internet End-to-end performance monitoring for 1 year
• LBL/NIMI project
• Proposed joint work with NLANR to extend Mapnet Java tools to view our data
/afs/slac/u/sf/cottrell/talk/escc/oct97 50
NLANR Mapnet Tool
• Java Applet
• Zoom & pan
• Select ISPs
• Color:– ISP– bandwidth
• Mouse over– link details– node details
/afs/slac/u/sf/cottrell/talk/escc/oct97 51
Maproute (from NLANR)
• Shapes show function– router at NAP, at
transit backbone, at ISP
• Color show variance of transit time
• Meshes of paths to destination show flaps
• Can zoom into get site information etc.
/afs/slac/u/sf/cottrell/talk/escc/oct97 52
Coordination etc.
• XIWT/IPWT interest/deployment
• MICS funded joint SLAC/LBL proposal on Internet End-to-end performance monitoring for 1 year
• LBL/NIMI project
• Proposed joint work with NLANR to extend Mapnet Java tools to view our data
• Will submit paper to IETF for this December
• Surveyor installation proposed at ESnet sites
/afs/slac/u/sf/cottrell/talk/escc/oct97 53
Surveyor
• PC Hardware with GPS located at ANS & 23 CSG partner sites
• Measure one way loss & response time using clock synchronization, metrics defined by IETF/IPPM
• 8 sites now operational, monitor 56 paths ((N-1)*N)
• Results show can have big asymmetries (asymmetric loading & routing)
• Willing to deploy (at their cost) at 5 DOE sites
• For more see http://www.advanced.org/csg-ippm/
/afs/slac/u/sf/cottrell/talk/escc/oct97 54
Asymmetric One-way Delays
0%
20%
Loss Loss
Delay Delay
Advanced to U Chicago U Chicago to Advanced
0ms
300ms
0 24
/afs/slac/u/sf/cottrell/talk/escc/oct97 55
Next Steps• Longer term reports (10 week exceptions, 180 days,
monthly going back forever)
• Provide monthly summary tables with lots of statistical measures to allow faster generation of long term reports, and more robust metrics
• Extend grouping, e.g. by AS, country, time zones crossed, more geographic regions, user selectable, by experiment, by community, by collection site
• Summaries (c.f. Weather Map, top 10s, weekly, Consumer Reports)
• NIMI/Surveyor install, NLANR tools, help XIWT
/afs/slac/u/sf/cottrell/talk/escc/oct97 56
Summary• 12 sites, 6 countries collecting data on > 400 links
• Need care selecting remote sites
• Deployment of data collection went well
• Collection sites easy to maintain after initial install
• Biggest effort at the moment (> 1 FTE) is in:– Tool definition & development– Data gathering archiving (looking after pathologies)
• Gearing up to extend SAS tools and attendant scripts
• Lot of interest & collaboration outside ESnet
/afs/slac/u/sf/cottrell/talk/escc/oct97 57
To Join
• Collection site needs:– perl5 & HTTP server– install timeping & pingdata (need only cgi-bin access,
not root)– Decide on links to monitor– Get an analysis site to retrieve & generate graphs, or at
least get connectivity.pl & ping_data_plot.pl
• Need volunteers to work on analysis scripts, some of it will require SAS, also need Java applets to visualize,
/afs/slac/u/sf/cottrell/talk/escc/oct97 58
More Information• Monitoring WG home page (includes links to the status
report, meeting notes, how to access data, and get & install code etc.)– http://www.slac.stanford.edu/xorg/icfa/ntf/home.html
• WAN Monitoring at SLAC has lots of links– http://www.slac.stanford.edu/comp/net/wan-mon.html
• Tutorial on WAN Monitoring– http://www.slac.stanford.edu/comp/net/wan-mon
/tutorial.html
Recommended