View
238
Download
2
Embed Size (px)
Citation preview
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
1
Internet Monitoring - Results
Les Cottrell SLAC
Presented at the ICFA Meeting, CERN, Mar 1998Partially funded by MICS joint SLAC/LBL proposal on Internet End-to-end Performance
Monitoring (IEPM)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
2
Outline of Talk
• What, why & how are we (ESnet/HENP community) measuring?
• What PingER measurement reports are available and what do they show– (short), intermediate & long term
– grouping and multi-site visualization
• Traffic volume & Traceroute measurements• Summary
– Deployment/development, Internet Performance, Next Steps
– Collaborations
– NIMI/IPWT
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
3
Why go to the effort?• Apparent quality of Internet getting worse as size
and demands increase
• Internet woefully under-measured & under-instrumented
• Internet very diverse - no single path typical
• Users need:– realistic expectations, planning information– guidelines for setting and validating SLAs– information to help in identifying problems– help to decide where to apply resources
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
4
Importance of Response Time
• Time is scarcest and most valuable commodity– Studies in late 70’s and early 80s showed the economic
value of Rapid Response Time • 0-0.4s High productivity interactive response
• 0.4-2s Fully interactive regime
• 2-12s Sporadically interactive regime
• 12s-600s Break in contact regime
– >600sBatch regime– Threshold around 4-5s complaints increase rapidly. – Voice has threshold around 100ms
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
5
Perception of Poor Packet Loss
• Above 4-6% packet loss video conferencing becomes irritating, and non native language speakers become unable to communicate.
• The occurrence of long delays of 4 seconds or more at a frequency of 4-5% or more is also irritating for interactive activities such as telnet and X windows.
• Above 10-12% packet loss there is an unacceptable level of back to back loss of packets and extremely long timeouts, connections start to get broken, and video conferencing is unusable.
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
6
Our Main Metric is Ping
• “Universally available”, easy to understand– no software for clients to install
• Low network impact
• Provides useful real world measures of loss, response time, reachability, unpredictability
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
7
Ping Response vs Web Response 1/2
y = 1.7135x + 719.83y = 2.5726x
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 100 200 300 400 500
Minimum Ping Response in msec.
GE
T R
esp
on
se i
n m
sec.
y = 2x
y = 1.71x + 720
y = 2.57x
HT
TP
GE
T R
espo
nse
(ms)
Minimum Ping Response (ms)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
8
Ping Response vs Web Response 2/2Residuals for Min HTTP-GET = 2 * Min Ping
0
20
40
60
80
100
120
140
160
180
200
-400 -200 0 200 400 600 800 1000
Residual (ms)
Fre
qu
en
cy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Cu
mu
lati
ve
%
Frequency
Cumulative %
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
9
Ranked packet loss for 3 months
Rome
UK
Stanford
Cincinnatti
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
10
Sawtooth EffectPacket Loss for UK sites Jan-95 thru Jan98 seen from
SLAC
0
5
10
15
20
25
30
35
40
Jan-
95
Mar
-95
May
-95
Jul-9
5
Sep-9
5
Nov-9
5
Jan-
96
Mar
-96
May
-96
Jul-9
6
Sep-9
6
Nov-9
6
Jan-
97
Mar
-97
May
-97
Jul-9
7
Sep-9
7
Nov-9
7
Jan-
98
% P
rim
e t
ime
pa
cle
t lo
ss
gla.ac.uk
rl.ac.uk
Added 45 Mbps (quadrupled capacity)
2 * capacity (+ 2Mbps) 3 * capacity
+ 9 Mbps
Holidays
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
11
RAL Last 180 Days plotLines are simply cubic splines fits to aid eye
Upper green and black points are response time in ms
Red & blue are weekday loss
Cyan are weekend loss
Note weekend/weekday differences (cyan vs blue)
Note Xmas/New Year lull
Also note quick onset of saturation at end August & September
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
12
Italian sites look similar to each otherPacket Loss for Italian sites Jan-95 thru Jan-98 seen from SLAC
0
5
10
15
20
25
30
35
40
Jan-
95
Ma
r-9
5
Ma
y-9
5
Jul-9
5
Se
p-9
5
No
v-9
5
Jan-
96
Ma
r-9
6
Ma
y-9
6
Jul-9
6
Se
p-9
6
No
v-9
6
Jan-
97
Ma
r-9
7
Ma
y-9
7
Jul-9
7
Se
p-9
7
No
v-9
7
Jan-
98
% P
rim
e t
ime
pa
ck
et
los
s
ge.infn.it
lnf.infn.it
na.infn.it
pd.infn.it
roma1.infn.it
ts.infn.it
cern.ch
desy.de
ethz.ch
fzu.cz
gla.ac.uk
ihep.ac.cn
inp.nsk.su
kek.jp
phy.tu-dresden.derl.ac.uk
rmki.kfki.hu
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
13
• Note RL (UK) saw-tooths as add UK-US bandwidth (Apr-96, Feb-97, Aug-97)
Representative International HENP Site Loss Jan-95 thru Nov-97
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
14
Aggregation• Group measurements, for example:
– by area (e.g. N. America E, N. America E, W. Europe/Japan, others, by country)
– trans-oceanic links, intercontinental links– separation e.g. number of hops, time zones crossed, IXPs
crossed– ISP (ESnet, vBNS/I2, ...)– by monitoring site– one site seen from multiple sites– common interest/affiliation (XIWT, HENP …)– user selectable
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
15
Group Selection (all sites monitoring CERN)
CMUCMUCNAFRLFNALSLACDESYDESYCareltonRMKIRMKICERNKEK
Select one of these groups
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
16
• Improved between 1 and 2.5% / month
• Response & Loss similar improvements– care with new sites
Group Response Time Jan-95 Nov-97
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
17
• Frequency of zero packet loss (for all time - not cut on prime time)
Network Quiescence
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
18
Ping Loss Quality
• Want quick to grasp indicator of link quality
• Loss is the most sensitive indicator– loss of packet requires ~ 4 sec TCP retry timeout– Studies on economic value of response time by IBM
showed there is a threshold around 4-5secs where complaints increase.
– 0-1% = Good 1-2.5% = Acceptable– 2.5%-5% = Poor 5%-12% = Very Poor– > 12% = Bad
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
19
Quality Distributions
• ESnet median good quality
• All other groups poor or very poor
• Critical to have good peering
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
20
Rem
ote
Site
sMulti Collection Site Visualization
Collection Sites
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
21
• Move mouse over ? to see # links
Looks pretty bad for intercontinental use
Intercontinental Grouping (Loss)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
22
Mouseover red dots gives more information on TLD (e.g. ch=Switzerland)
Diagonals are within TLD
Top Level Domain Grouping (Loss)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
23
TLD (Response Time)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
24
Grouping DetailsSelect groupSelect metric
Sort
Color for quality
Also provides Excel for DIY at bottom
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
25
Recent Transoceanic trendsMedian Transoceanic Packet Loss for
172 Links seen from 14 Sites
0
2
4
6
8
10
12
14
16
Sep-97 Oct-97 Nov-97 Dec-97 Jan-98 Feb-98
% P
acke
t L
oss
25%
75%
Median
Expon. (Median)
9.5% improvement/month
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
26
By Monitoring SiteHENP Median Monthly Packet Loss for 163 links seen from 7 Monitoring Sites
0
2
4
6
8
10
12
14
16
Sep-97 Oct-97 Nov-97 Dec-97 Jan-98 Feb-98
Me
dia
n m
on
thly
pa
ck
et
los
s
INFN/CNAF Median
HEPNRC/FNAL Median
SLAC Median
DESY Median
RMKI/KFKI Median
CERN Median
KEK MedianSite # %/month INFN 14 26%FNAL 46 1.5%SLAC 18 15%DESY 13 -12%RMKI 5CERN 61 17%KEK 6
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
27
CERN Monitoring TLDsHENP Packet loss seen from CERN For
4 main groups of TLDs
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Sep-97 Oct-97 Nov-97 Dec-97 Jan-98 Feb-98
% P
acke
t lo
ss
JP
ESnet
EDU
CA
# % Improvement4 16%7 43%39 8.5%5 4.2%
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
28
ESnet bytes accepted by site for Jan ‘98
Exchanges
LBL/ESnet
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
29
US HENP Traffic Growth
Exponential growth from 3-6%
Monthly ESnet Bytes Accepted from Major US HENP Labs Jun-90 thru Jan-98
1000000000
10000000000
1E+11
1E+12
1E+13
Jun-
90
Oct-90
Feb-9
1
Jun-
91
Oct-91
Feb-9
2
Jun-
92
Oct-92
Feb-9
3
Jun-
93
Oct-93
Feb-9
4
Jun-
94
Oct-94
Feb-9
5
Jun-
95
Oct-95
Feb-9
6
Jun-
96
Oct-96
Feb-9
7
Jun-
97
Oct-97
log
of
byt
es a
ccep
ted
/mo
nth anl bnl
cebaf fnallbl slacExpon. (cebaf) Expon. (bnl)Expon. (slac) Expon. (lbl)Expon. (fnal) Expon. (anl)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
30
CERN-US E1(2Mbps) link
Added 2nd 2Mbps link
Multi Router Traffic Grapher (MRTG)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
31
Traffic Volume for Germany (DFN)
DFN T1 Utilization for 15 Jan ‘98 (5 min averages)
ESnet Serial Traffic Peaks -- December 1997 (created Jan 15, 1998 ) Line Speed = 15360000 bits/sec
Number of Peaks vs Percent of Utilization
cebaf1-dfn2_2min
cebaf1-dfn2_10min
cebaf1-dfn2_60min
dfn2-cebaf1_2min
dfn2-cebaf1_10min
dfn2-cebaf1_60min
0 5,000 10,000 15,000 20,000 25,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
# Samples
DFN T1 Utilization 15 Jan ‘98 (5 min averages)
# of 2 min periods in Dec-96 with peak utilization > y %
From US
To US
Green = to USBlue = from US
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
32
Capacity/Load Ratios
• Looking at the link capacity/average load– Most ESnet links show ratios of a few to several tens– The international links (CERN-Perryman (~4), DFN
(~5), Italy (~4), KEK (~10), Canada (15)) show ratios of 4-15
– The worst link appears to be the MAE-W-ESnet link at about 1.5 ratio
• However this may not be the bottleneck link
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
33
Bottlenecks• Identification
– Traceroute • from/to multiple sites can identify common path segments in
the maps
• Can see onset of losses with traceping
– Pathchar can identify bottlenecks
• Then need to work on:– avoiding bottlenecks (new peering)– getting bottleneck owners to improve– this is difficult, lots of potential bottlenecks, bottlenecks
move, not under our control
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
34
Mul
iple
rou
tes
seen
TracePing (Oxford)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
35
Traceroute
• Reverse traceroute servers
• Traceping
• TopologyMap– Ellipses show node on
route
– Open ellipse is measurement node
– Blue ellipse no reachable
• Keeping history
From TRIUMF
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
36
GUI Traceroute (e.g. VisualRoute)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
37
• Deployment Development– ESnet/HENP has 14 Collection sites in 8 countries
collecting data on > 500 links involving 22 countries– XIWT/IPWT deployed ~ 10 collection sites using
PingER tools– 600MB/month/link, 6 bps/link, .25 FTE @ analysis site,
1.5-2.5 FTE on analysis– HEPNRC gathering, archiving– Long term reports being ported to HEPNRC from SLAC– Long term analysis today usually requires tool like SAS
Summary
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
38
Summary• Deployment Development
• Internet Performance – Performance within ESnet is good– Performance between ESnet & other sites is poor to very
poor on average• one of main causes is congestion points, so peering is critical
– Intercontinental performance is very poor to bad– ESnet traffic accepted from major HENP labs growing
by 3-6% per month– Response time improving by 1-2% / month– Packet loss improving between SLAC & other sites by
3% / month
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
39
Summary• Deployment Development
• Internet Performance (continued):– Links to sites outside N. America vary from good (KEK)
to bad– Some of the bad sites are to be expected, e.g. FSU,
China, Czeck Republic, some surprises such as UK– CERN, France, Germany acceptable to poor
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
40
Summary• Deployment Development
• Internet Performance
• Next Steps– Improve tools:
– Make long term reports at Analysis site available & understandable
– Look into prediction (extrapolations, develop models, configure and validate with data)
– Pursue IETF Surveyor & NIMI deployment
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
41
National Internet Measurement Infrastructure (NIMI)
• Secure, scalable infrastructure for scheduling monitoring, gathering data
• Minimal amount of human intervention
• Inexpensive probe built on PC FreeBSD platform
• Dynamic - can add/modify measurement suites, initially includes:– Traceroute– TReno - measures bulk transfer thruput– Poip - one way ping
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
42
Asymmetric One-way Delays
0%
20%
Loss Loss
Delay Delay
Advanced to U Chicago U Chicago to Advanced
0ms
300ms
0 24
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
43
NIMI• Deployed at PSC, LBL, FNAL, platforms being
configured at SLAC & CERN
• As NIMI becomes more real will start to use as infrastructure for IPPM Surveyors
• Security– allows full policy control over any box you own or
delegation of all or subsets– uses ACLs with authentication for requests, and
encryption to prevent sniffing
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
44
Summary• Deployment Development
• Internet Performance
• Next Steps
• Lots of collaboration:– SLAC & HEPNRC
– 14 collection sites, ~ 400 remote sites
– Collection site tools CERN & CNAF/ICFA
– Oxford/TracePing
– MapPing/MAPNet/NLANR
– TRIUMF Traceroute topology Map
– NIMI/LBNL & Surveyor/IETF
– XIWT/IPWT
– Talks at IETF, XIWT, ICFA, ESCC ...
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt
45
More Information• ICFA Monitoring WG home page (links to status report,
meeting notes, how to access data, and code)– http://www.slac.stanford.edu/xorg/icfa/ntf/home.html
• WAN Monitoring at SLAC has lots of links– http://www.slac.stanford.edu/comp/net/wan-mon.html
• Tutorial on WAN Monitoring– http://www.slac.stanford.edu/comp/net/wan-mon/tutorial.html
• MapPing Tool:– http://www.slac.stanford.edu/~warrenm/work/java/newjava/mapping.html
• NIMI http://www.psc.edu/~mahdavi/nimi_paper/NIMI.html