Network Performance for ATLAS Real-Time Remote Computing Farm StudyAlberta, CERN Cracow, Manchester, NBI
MOTIVATIONSeveral experiments, including ATLAS at the Large Hadron Collider (LHC) and D0 at Fermi Lab, have expressed interest in using remote computing farms for processing and analysing, in real time, the information from particle collision events. Different architectures have been suggested from pseudo-real-time file transfer and subsequent remote processing, to the real-time requesting of individual events as described here.
To test the feasibility of using remote farms for real-time processing, a collaboration was set up between members of ATLAS Trigger/DAQ community, with support from several national research and education network operators (DARENET, Canarie, Netera, PSNC, UKERNA and Dante) to demonstrate a Proof of Concept and measure end-to-end network performance. The testbed was centred at CERN and used three different types of wide area high-speed network infrastructures to link the remote sites:• an end-to-end lightpath (SONET circuit) to the University of Alberta in Canada• standard Internet connectivity to the University of Manchester in the UK and the Niels Bohr Institute in Denmark• a Virtual Private Network (VPN) composed out of an MPLS tunnel over the GEANT and an Ethernet VPN over the PIONIER networks to IFJ PAN Krakow in Poland.
Remote Computing Concepts
ROBROBROBROB
L2PUL2PUL2PUL2PU
SFISFI SFI
PFLocal Event Processing Farms
ATLAS Detectors – Level 1 Trigger
SFOs
Mass storageExperimental Area
CERN B513
CopenhagenEdmontonKrakowManchester
PF
Remote Event Processing Farms
PF
PF PF
ligh
tpat
hs
PF
Data Collection Network
Back End Network
GÉANT
Switch
Level 2 Trigger
Event Builders
CERN-Manchester TCP Activity
TCP/IP behaviour of the ATLAS Request- Response Application Protocol observed with Web100
64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 19 round trips or ~ 380 ms
TCP Congestion window in RedThis is reset by TCP on each Request due to lack of data sent by the application over the network.TCP obeys RFC 2518 & RFC 2861
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Data
Byte
s O
ut
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
ackets
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
Observation of the Status of Standard TCP with web100
Observation of TCP with no Congestion window reduction
TCP Congestion window in Red grows nicelyRequest-response takes 2 rtt after 1.5 sRate ~ 10 events/s with 50 ms processing time
Transfer achievable throughput grows to 800 Mbit/sData Transferred when the Application requires the data
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
3 Round Trips
2 Round Trips
The ATLAS Application Protocol
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter EFD SFI and SFO
Event Request: EFD requests an event from SFI SFI replies with the event data
Processing of the event occurs
Return of Computation:EF asks SFO for buffer spaceSFO send OKEF transfers the results of the computation
CERN-Alberta TCP Activity
64 Byte Request in Green 1 Mbyte reponse in Blue TCP in Slow Start takes 12 round trips or ~ 1.67 s
Observation of TCP with no Congestion window reduction with web100
TCP Congestion window in Red grows graduallyafter slowstartRequest-response takes 2 rtt after ~2.5 sRate ~ 2.2 events/s with 50 ms processing time
Transfer achievable throughput grows from 250 to 800 Mbit/s
2 RoundTrips
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
nu
m P
ackets
0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
Principal partners
Web100 parameters on the server located at CERN (data source)
Green – small requests Blue – big responsesTCP ACK packets also counted (in each direction)One response = 1 MB ~ 380 packets
64 byte Request 1 Mbyte Response
CERN-Kracow TCP Activity
Steady state request-response latency ~140 msRate ~ 7.2 events/sFirst event takes 600 ms due to TCP slow start