Upload
estella-holland
View
215
Download
1
Embed Size (px)
Citation preview
Slide: 1Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 1
Investigating the Network Performance of Remote Real-Time Computing Farms
For ATLAS Trigger DAQ.
Richard Hughes-Jones University of Manchester
In Collaboration with:Bryan Caron University of Alberta
Krzysztof Korcyl IFJ PAN Krakow
Catalin Meirosu Politehnica University of Bucuresti & CERN
Jakob Langgard Nielsen Niels Bohr Institute
Slide: 2Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 2
Introduction
Poster: On the potential use of Remote Computing Farms in the ATLAS TDAQ System
Slide: 3Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 3
Atlas Computing Model
Tier2 Centre ~200kSI2k
Trigger &Event Builder
Event Filter~7.5MSI2k
UK Regional Centre (RAL)
US Regional Centre
French Regional Centre
Dutch Regional Centre
SheffieldManchester
LiverpoolLancaster ~0.25TIPS
10 GByte/sec
320 MByte/sec
100 - 1000 MB/s links
Physics data cache
~PByte/sec
~ 75MB/s/T1 for ATLAS
Tier2 Centre ~200kSI2k
Tier2 Centre ~200kSI2k622Mb/s – 1 Gbit/s links
Tier 0Tier 0
Tier 1Tier 1
DesktopDesktop
PC (2004) = ~1 kSpecInt2k
Northern Tier ~200kSI2k
Tier 2Tier 2
~5 PByte/yearno simulation
~2 PByte/year/T1
~200 TByte/year/T2
CERN Center PBytes of Disk;
Tape Robot
High Bandwidth Network Many Processors Experts at Remote sites
•Remote institute filtering
• calibration
•monitoring
Slide: 4Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 4
Remote Computing Concepts
ROBROBROBROB
L2PUL2PUL2PUL2PU
SFISFI SFI
PFLocal Event Processing Farms
ATLAS Detectors – Level 1 Trigger
SFOs
Mass storageExperimental Area
CERN B513
CopenhagenEdmontonKrakowManchester
PF
Remote Event Processing Farms
PF
PF PF
ligh
tpat
hs
PF
Data Collection Network
Back End Network
GÉANT
Switch
Level 2 Trigger
Event Builders
Slide: 5Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 5
ATLAS Remote Farms – Network Connectivity
Slide: 6Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 6
ATLAS Application Protocol
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes
Processing of event Return of computation
EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
Tcpmon - instrumented tcp request-response program emulates the Event Filter EFD to SFI communication.
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter Daemon EFD SFI and SFO
Slide: 7Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 7
Networks and End Hosts
Slide: 8Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 8
End Hosts & NICs CERN-nat-Manc.
Request-Response Latency
Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network
SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus
pcatb121-nat-gig6_13Aug04
0100200300400500600700800900
1000
0 10 20 30 40
Spacing between frames us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
pcatb121-nat-gig6_13Aug04
0
20
40
60
80
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
pcatb121-nat-gig6_13Aug04
0
5
10
15
0 5 10 15 20 25 30 35 40Spacing between frames us
num
re-
orde
red
50 bytes
100 bytes 200 bytes
400 bytes 600 bytes
800 bytes 1000 bytes
1200 bytes 1400 bytes
1472 bytes
256 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t
)
512 bytes pcatb121-nat-gig6
0
2000
4000
6000
8000
10000
20900 21100 21300 21500Latency us
N(t
)
1400 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
20900 21100 21300 21500Latency us
N(t
)
The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC
receiving the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS
have no re-ordering
Slide: 9Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 9
Using Web100 TCP Stack Instrumentation
to analyse application protocol - tcpmon
Slide: 10Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 10
tcpmon: TCP Activity Manc-CERN Req-Resp
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Dat
a B
ytes
Ou
t
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start 1st event takes 19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Dat
a B
ytes
Ou
t0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack implementation detail to reduce Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
chiv
e M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
Slide: 11Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 11
tcpmon: TCP activity Manc-cern Req-RespTCP stack tuned
Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500 3000time
Da
ta B
yte
s O
ut
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
chiv
e M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
acke
ts
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window
grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)
Transfer achievable throughputgrows to 800 Mbit/s
Slide: 12Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 12
Round trip time 150 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s
tcpmon: TCP activity Alberta-CERN Req-RespTCP stack tuned
TCP Congestion windowin slow start to ~1.8s then congestion avoidance
Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)
Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Dat
a B
ytes
Ou
t
0
50
100
150
200
250
300
350
400
Dat
a B
ytes
In
DataBytesOut (Delta DataBytesIn (Delta
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
chiv
e M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
nu
m P
acke
ts0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
Slide: 13Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 13
SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/sis here!
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
chiv
e M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 14Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 14
Time Series of Request-Response Latency
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
2000.00
0 50 100 150 200 250 300
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
1000000
Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation
160.30
160.35
160.40
160.45
160.50
160.55
160.60
200 205 210 215 220 225 230 235 240 245 250
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
25.00
30.00
35.00
40.00
45.00
50.00
55.00
60.00
65.00
70.00
75.00
0 10 20 30 40 50 60 70 80 90 100
Request Time s
Ro
un
d T
rip
La
tne
cy
ms Manchester – CERN
Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms
Slide: 15Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 15
Using the Trigger DAQ Application
Slide: 16Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 16
Time Series of T/DAQ event rate
Manchester – CERN Round trip time 20 ms 1 Mbyte of data returned
3 nodes: 1 GEthernet + two 100Mbit 2 nodes: two 100Mbit nodes 1node: one 100Mbit node
Event Rate: Use tcpmon transfer time of ~42.5ms Add the time to return the data
95ms Expected rate 10.5/s Observe ~6/s for the gigabit node Reason: TCP buffers could not be set large enough in
T/DAQ application
0
5
10
15
20
0 100 200 300 400Time Sec
Event
Rate
event/
s
0
1
2
3
4
No
. re
mo
te n
od
es
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12Events/sec
Fre
quen
cy
Slide: 17Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 17
Tcpdump of the Trigger DAQ Application
Slide: 18Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 18
tcpdump of the T/DAQ dataflow at SFI (1)Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
Incoming event request
Followed by ACK
N 1448 byte packets
SFI sends event
Limited by TCP receive buffer
Time 115 ms (~4 ev/s)
When TCP ACKs arrive
more data is sent.
●●●
Slide: 19Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 19
Tcpdump of TCP Slowstart at SFI (2)Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
First event request
N 1448 byte packets
SFI sends event
Limited by TCP Slowstart
Time 320 ms
When ACKs arrive
more data sent.
Slide: 20Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 20
tcpdump of the T/DAQ dataflow for SFI &SFO Cern-Manchester – another test run 1.0 Mbyte event Remote EFD requests events from SFI
Remote EFD sending computation back to SFO Links closed by Application
Link setup &
TCP slowstart
Slide: 21Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 21
Some First Conclusions
The TCP protocol dynamics strongly influence the behaviour of the Application.
Care is required with the Application design eg use of timeouts. With the correct TCP buffer sizes
It is not throughput but the round-trip nature of the application protocol that determines performance.
Requesting the 1-2Mbytes of data takes 1 or 2 round trips TCP Slowstart (the opening of Cwnd) considerably lengthens time for the first
block of data. Implementation “improvements” (Cwnd reduction) kill performance!
When the TCP buffer sizes are too small (default) The amount of data sent is limited on each rtt Data is send and arrives in bursts It takes many round trips to send 1 or 2 Mbytes
The End Hosts themselves CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power
Slide: 22Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 22
Summary
We are investigating the technical feasibility of remote real-time computing for ATLAS.
We have exercised multiple 1 Gbit/s connections between CERN and Universities located in Canada, Denmark, Poland and the UK Network providers are very helpful and interested in our experiments
Developed a set of tests for characterization of the network connections Network behavior generally good – e.g. little packet loss observed
Backbones tend to over-provisionedHowever access links and campus LANs need care.
Properly configured end nodes essential for getting good results with real applications.
Collaboration between the experts from the Application and Network teams is progressing well and is required to achieve performance.
Although the application is ATLAS-specific, the information presented on the network interactions is applicable to other areas including: Remote iSCSI Remote database accesses Real-time Grid Computing – eg Real-Time Interactive Medical Image processing
Slide: 23Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 23
Thanks to all who helped, including:
National Research NetworksCanarie, Dante, DARENET, Netera, PSNC and UKERNA
“ATLAS remote farms” J. Beck Hansen, R. Moore, R. Soluk,
G. Fairey, T. Bold, A. Waananen, S. Wheeler, C. Bee
“ATLAS online and dataflow software” S. Kolos, S. Gadomski, A. Negri, A. Kazarov, M. Dobson,
M. Caprini, P. Conde, C. Haeberli, M. Wiesmann, E. Pasqualucci, A. Radu
Slide: 24Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 24
More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Slide: 25Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 25
Any Questions?
Slide: 26Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 26
Backup Slides
Slide: 27Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 27
End Hosts & NICs CERN-Manc.
Request-response Latency
Throughput Packet Loss Re-Order Use UDP packets to characterise Host & NIC
SuperMicro P4DP8 motherboardDual Xenon 2.2GHz CPU400 MHz System bus66 MHz 64 bit PCI bus
pcatb89-gig6_18Jul04
0100200300400500600700800900
1000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
pcatb89-gig6_18Jul04
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss 50 bytes
100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
pcatb89-gig6_18Jul04
020406080
100120
0 5 10 15 20 25 30 35 40Spacing between frames us
Num
re-
orde
red
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
64 bytes pcatb89-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t)
512 bytes pcatb89-gig6
0100020003000400050006000700080009000
20900 21100 21300 21500Latency us
N(t)
1400 bytes pcatb89-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t)
Slide: 28Richard Hughes-Jones
IEEE Real Time 2005 Stockholm, 4-10 June, R. Hughes-Jones Manchester 28
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e to
rec
ove
r se
c
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 20 ms USA 150 ms