PIPE DreamsPIPE DreamsPIPE DreamsPIPE Dreams
Trouble Shooting Network Performance for Production Science Data Grids
Presented by Warren Matthews at CHEP’03, San Diego March 24-28, 2003
AbstractAbstractAbstractAbstract
The vision of science grids allocating resources to analyze huge quantities of HENP data clearly depends on reliable network performance. Tools developed at SLAC in conjunction with the Internet2 PIPES project will help to ensure this. In this talk, these tools will be discussed and the procedure for publishing performance data, in particular using the Globus toolkit's MDS and web services will be reviewed. The subsequent analysis and trouble-shooting methodology will be discussed with real world examples from the particle physics data grid (PPDG) and the European data grid (EDG).
OverviewOverviewOverviewOverview
• What is the problem ?
• What is PIPES ?
• Network performance monitoring
• Problem identification
Resource BrokerResource Broker
FarmFarm
FarmFarm
FarmFarm
DataData
DataData
DataData
requestorrequestor
The Network
Network Monitoring for the GridNetwork Monitoring for the GridNetwork Monitoring for the GridNetwork Monitoring for the Grid
• The Data Grid consists of many components that must interoperate
requestorrequestor
Resource BrokerResource Broker
Farm
Farm
Farm
DataData
DataData
DataData
requestorrequestor
The Network
Allocate ResourcesAllocate ResourcesAllocate ResourcesAllocate Resources
• The resource broker must be fully informed
• Measurement is required !
requestorrequestor
12% pkt loss
OC4880% Utilization
What is PIPES ?What is PIPES ?What is PIPES ?What is PIPES ?
• Internet2
• End-to-end performance initiative
• PI Performance Evaluation System (PIPES)
• PIPES Monitoring Platform (PMP)
• Overlap with goals of HENP
• Tremendous resources
IEPM-BWIEPM-BWIEPM-BWIEPM-BW
• Package developed at SLAC– Measurement Engine
• Iperf, bbftp, bbcp, ping, traceroute• Abwe, owamp, udpmon, gridftp
– Job Manager– Data Storage and data server– Analysis Engine
SNV
SLAC
CHI
ESnet
NY
Stanford
CalREN
NERSC
LANL
JLAB
TRIUMF
KEK
Abilene
SLAC
SNV
FNAL
ANL
NIK
HEF
CERN
IN2P3
CERN
CALTECH
SDSC
BNL
JAnet
HSTN
SEA
ATLCLV
IPLS
RAL
UCL
UManc
DLNNW
NY
RiceUT
Dallas
NCSA UM
ichI2
SOX
UFL
APAN
RIKENINFN-Roma
INFN-Milan
CESnet
APAN
Geant
EDG
PPD
G/G
riP
hyN
Monitoring S
ite
ORNL
Stanford
UTAH
DNVR
ORNL
NASAWASH
Imperial
INFN-Padua
SLAC
Manchester
Bristol
Dresden
IN2P3
RAL
Stanford
Calren
Abilene
Renater
DFN
Janet
NNW
TVN
SWERN
ESnet
BaBar Grid
Geant
622Mbps
2.5 Gbps
1 Gbps
10 Gbps
Throughput from SLAC to RAL between May 2002 and February 2003
0
50000
100000
150000
200000
250000
5/13
/200
2
5/27
/200
2
6/10
/200
2
6/24
/200
2
7/8/
2002
7/22
/200
2
8/5/
2002
8/19
/200
2
9/2/
2002
9/16
/200
2
9/30
/200
2
10/1
4/20
02
10/2
8/20
02
11/1
1/20
02
11/2
5/20
02
12/9
/200
2
12/2
3/20
02
1/6/
2003
1/20
/200
3
2/3/
2003
2/17
/200
3
iperf
bbcpmem
bbcpdisk
bbftp
Problem IdentificationProblem IdentificationProblem IdentificationProblem Identification
• Typical Scenario– User complains file transfer is slow– Net admin runs ping, traceroute, iperf test– Complain to upstream provider
• Proactive– What do we mean by throughput?– How do we know there was a performance
hit?– Our approach is diurnal changes
AlarmsAlarms
• Too much to keep track of
• Rather not wait for complaints
• Automated Alarms
• Rolling average à la RIPE-TT– May not be the best approach
• AMP Automated Detection System
LimitationsLimitationsLimitationsLimitations
• Could be over an hour before alarm is generated
• More frequent measurements impact the network and measurements overlap
• Low impact tools allow finer grained measurement– Use NWS multi-variate method– Use SCIDAC ABwE tool– Use PingER, OWAMP
Available Bandwidth Estimate between SLAC and Caltech in February 2003
0
20
40
60
80
100
120
140
160
180
200
2/25/2003 0:00 2/25/2003 12:00 2/26/2003 0:00 2/26/2003 12:00 2/27/2003 0:00
Ban
dw
idth
in
Mb
ps
PublishingPublishingPublishingPublishing
• Many monitoring projects, publish data to allow them to inter-operate
• MDS– EDG NM Schema
• Web Services– GLUE NE Schema
• GGF NMWG– Hierarchy Doc– Tools Doc
./get_data2003 3 18 6 1 41 1.61 1.601 1.62 0
Net RatNet RatNet RatNet Rat
• Alarm System– Multiple tools– Multiple measurement points– Trigger further measurements– Cross reference off site stats
• Informant database
• No measurement is ‘authoritative’– Cannot even believe a measurement
LogLogLogLog
03/20/2003 20:13:46 ALARM pcgiga throughput=305.224 ctresh=512.95 athresh=312.9103/20/2003 20:13:48 TRACE no change in route detected03/20/2003 20:16:07 CALM Throughput within acceptable limits. ALARM CANCELLED
Toward a Monitoring InfrastructureToward a Monitoring InfrastructureToward a Monitoring InfrastructureToward a Monitoring Infrastructure
• MAGGIE– Measurement and Analysis package
built on NIMI/Akenti
• EDEE– production-quality Data Grid for Europe
More InformationMore InformationMore InformationMore Information
• IEPM Home Page• IEPM-BW• I2 E2E and PIPES• RIPE-TT• AMP Automated Event Detection• NWS• ABWE
EndEndEndEnd
This talk made possible by the IEPM team at SLAC (Les Cottrell, Connie Logg, Jiri Navratil, Jerrod Williams, Fabrizio Coccetti), and the many developers and maintainers around the world.