Upload
myron-cain
View
217
Download
3
Tags:
Embed Size (px)
Citation preview
Slide: 1Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1
TCP/IP and Other Transports for High Bandwidth Applications
Real Applications on Real Networks
Richard Hughes-Jones University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov”
Slide: 2Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 2
This is what researchers find when they try to use high performance networks.
Real Applications on Real Networks Disk-2-disk applications on real networks
Memory-2-memory testsComparison of different data moving applicationsThe effect (improvement) of different TCP StacksTransatlantic disk-2-disk at Gigabit speeds
Remote Computing FarmsThe effect of distanceProtocol vs implementation
Radio Astronomy e-VLBIUsers with data that is random noise !
Thanks for allowing me to use their slides to: Sylvain Ravot CERN, Les Cottrell SLAC, Brian Tierney LBL, Robin Tasker DLRalph Spencer Jodrell Bank
What we might cover!
Slide: 3Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 3
SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus
6 PCI PCI-X slots 4 independent PCI buses
64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X
Dual Gigabit Ethernet Adaptec AIC-7899W
dual channel SCSI UDMA/100 bus master/EIDE channels
data transfer rates of 100 MB/sec burst
“Server Quality” Motherboards
Slide: 4Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 4
“Server Quality” Motherboards
Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory
Theory BW: 6.4Gbit
HyperTransport
2 independent PCI buses 133 MHz PCI-X
2 Gigabit Ethernet SATA
( PCI-e )
Slide: 5Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 5
UK Transfers MB-NG and SuperJANET4
Throughput for real users
Slide: 6Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 6
Topology of the MB – NG Network
KeyGigabit Ethernet2.5 Gbit POS Access
MPLS Admin. Domains
UCL Domain
Edge Router Cisco 7609
man01
man03
Boundary Router Cisco 7609
Boundary Router Cisco 7609
RAL Domain
Manchester Domain
lon02
man02
ral01
UKERNADevelopment
Network
Boundary Router Cisco 7609
ral02
ral02
lon03
lon01
HW RAID
HW RAID
Slide: 7Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 7
Topology of the Production Network
KeyGigabit Ethernet2.5 Gbit POS Access10 Gbit POS
man01
RAL Domain
Manchester Domain
ral01
HW RAID
HW RAID routers switches
3 routers2 switches
Slide: 8Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 8
iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)
BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits
Slide: 9Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 9
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET
bbcp
bbftp
Apachie
Gridftp
Previous work used RAID0(not disk limited)
Slide: 10Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 10
bbftp: What else is going on? Scalable TCP
BaBar + SuperJANET
SuperMicro + SuperJANET
Congestion window – duplicate ACK Variation not TCP related?
Disk speed / bus transfer Application
Slide: 11Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 11
bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:
1200 Mbit/s read 600 Mbit/s write
Scalable TCP
BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s
SuperMicro + SuperJANET Instantaneous
400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s
SuperMicro + MB-NG Instantaneous
880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s
Slide: 12Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 12
Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on
MB-NGSuperMicro on
SuperJANET4
BaBar on
SuperJANET4
SC2004 on UKLight
Iperf Standard 940 350-370 425 940
HighSpeed 940 510 570 940
Scalable 940 580-650 605 940
bbcp Standard 434 290-310 290
HighSpeed 435 385 360
Scalable 432 400-430 380
bbftp Standard 400-410 325 320 825
HighSpeed 370-390 380
Scalable 430 345-532 380 875
apache Standard 425 260 300-360
HighSpeed 430 370 315
Scalable 428 400 317
Gridftp Standard 405 240
HighSpeed 320
Scalable 335
New stacksgive more
throughput
Rate decreases
Slide: 13Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 13
Transatlantic Disk to Disk Transfers
With UKLight
SuperComputing 2004
Slide: 14Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 14
SC2004 UKLIGHT Overview
MB-NG 7600 OSRManchester
ULCC UKLight
UCL HEP
UCL network
K2
Ci
Chicago Starlight
Amsterdam
SC2004
Caltech BoothUltraLight IP
SLAC Booth
Cisco 6509
UKLight 10GFour 1GE channels
UKLight 10G
Surfnet/ EuroLink 10GTwo 1GE channels
NLR LambdaNLR-PITT-STAR-10GE-16
K2
K2 Ci
Caltech 7600
Slide: 15Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 15
SCINet
Collaboration at SC2004 Setting up the BW Bunker
The BW Challenge at the SLAC Booth
Working with S2io, Sun, Chelsio
Slide: 16Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 16
Transatlantic Ethernet: TCP Throughput Tests Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP
Wire rate throughput of 940 Mbit/s
First 10 sec
Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing
0
500
1000
1500
2000
0 20000 40000 60000 80000 100000 120000 140000
time ms
TCPA
chiv
e M
bit/s
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
0
500
1000
1500
2000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time ms
TCPA
chiv
e M
bit/s
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
Cwnd
InstaneousBWAveBWCurCwnd (Value)
Slide: 17Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 17
SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:
Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s)
Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s
~4.5s of overhead)
Disk-TCP-Disk at 1Gbit/s
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time msT
CP
Ach
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBW
AveBW
CurCwnd (Value)
0
500
1000
1500
2000
2500
0 5000 10000 15000 20000
time ms
TC
PA
ch
ive M
bit
/s
050000001000000015000000200000002500000030000000350000004000000045000000
Cw
nd
InstaneousBWAveBWCurCwnd (Value)
Slide: 18Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 18
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
Network & Disk Interactions (work in progress) Hosts:
Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
Measure memory to RAID0 transfer rates with & without UDP trafficRAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
0
500
1000
1500
2000
0.0 20.0 40.0 60.0 80.0 100.0Trial number
Thro
ughput
Mbit/s
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4 8k
64k
y=178-1.05x
Disk write1735 Mbit/s
Disk write +1500 MTU UDP
1218 Mbit/sDrop of 30%
Disk write +9000 MTU UDP
1400 Mbit/sDrop of 19%
% CPU kernel mode
Slide: 19Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 19
Remote Computing Farms in the ATLAS TDAQ Experiment
Slide: 20Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 20
Remote Computing Concepts
ROBROBROBROB
L2PUL2PUL2PUL2PU
SFISFI SFI
PFLocal Event Processing Farms
ATLAS Detectors – Level 1 Trigger
SFOs
Mass storageExperimental Area
CERN B513
CopenhagenEdmontonKrakowManchester
PF
Remote Event Processing Farms
PF
PF PF
ligh
tpat
hs
PF
Data Collection Network
Back End Network
GÉANT
Switch
Level 2 Trigger
Event Builders
~PByte/sec
320 MByte/sec
Slide: 21Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 21
ATLAS Remote Farms – Network Connectivity
Slide: 22Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 22
ATLAS Application Protocol
Event Request EFD requests an event from SFI SFI replies with the event ~2Mbytes
Processing of event Return of computation
EF asks SFO for buffer space SFO sends OK EF transfers results of the computation
tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication.
Send OK
Send event data
Request event
●●●
Request Buffer
Send processed event
Process event
Time
Request-Response time (Histogram)
Event Filter Daemon EFD SFI and SFO
Slide: 23Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 23
Using Web100 TCP Stack Instrumentation
to analyse application protocol - tcpmon
Slide: 24Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 24
tcpmon: TCP Activity Manc-CERN Req-Resp
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start 1st event takes 19 rtt or ~ 380 ms
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
Data
Byte
s O
ut
0
50000
100000
150000
200000
250000
Cu
rCw
nd
DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value
TCP Congestion windowgets re-set on each Request
TCP stack implementation detail to reduce Cwnd after inactivity
Even after 10s, each response takes 13 rtt or ~260 ms
020406080
100120140160180
0 200 400 600 800 1000 1200 1400 1600 1800 2000time ms
TC
PA
ch
ive M
bit
/s
0
50000
100000
150000
200000
250000
Cw
nd
Transfer achievable throughput120 Mbit/s
Slide: 25Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 25
tcpmon: TCP Activity Manc-cern Req-RespTCP stack tuned
Round trip time 20 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500 3000time
Da
ta B
yte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100200300400
500600700800900
0 1000 2000 3000 4000 5000 6000 7000 8000time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000time ms
nu
m P
ackets
0
200000
400000
600000
800000
1000000
1200000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value TCP Congestion window
grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait)
Transfer achievable throughputgrows to 800 Mbit/s
Slide: 26Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 26
Round trip time 150 ms 64 byte Request green
1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned
TCP Congestion windowin slow start to ~1.8s then congestion avoidance
Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait)
Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s
0100000200000300000400000500000600000700000800000900000
1000000
0 1000 2000 3000 4000 5000time
Data
Byte
s O
ut
0
50
100
150
200
250
300
350
400
Data
Byte
s I
n
DataBytesOut (Delta DataBytesIn (Delta
0100
200300
400500
600700
800
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time ms
TC
PA
ch
ive M
bit
/s
0
200000
400000
600000
800000
1000000
Cw
nd
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
time msn
um
Packets
0
200000
400000
600000
800000
1000000
Cw
nd
PktsOut (Delta PktsIn (Delta CurCwnd (Value
Slide: 27Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 27
Time Series of Request-Response Latency
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
2000.00
0 50 100 150 200 250 300
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
1000000
Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation
160.30
160.35
160.40
160.45
160.50
160.55
160.60
200 205 210 215 220 225 230 235 240 245 250
Request Time s
Ro
un
d T
rip
La
ten
cy
ms
25.00
30.00
35.00
40.00
45.00
50.00
55.00
60.00
65.00
70.00
75.00
0 10 20 30 40 50 60 70 80 90 100
Request Time s
Ro
un
d T
rip
La
tne
cy
ms Manchester – CERN
Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms
Slide: 28Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 28
Using the Trigger DAQ Application
Slide: 29Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 29
Time Series of T/DAQ event rate
Manchester – CERN Round trip time 20 ms 1 Mbyte of data returned
3 nodes: 1 GEthernet + two 100Mbit 2 nodes: two 100Mbit nodes 1node: one 100Mbit node
Event Rate: Use tcpmon transfer time of ~42.5ms Add the time to return the data
95ms Expected rate 10.5/s Observe ~6/s for the gigabit node Reason: TCP buffers could not be set large enough in
T/DAQ application
0
5
10
15
20
0 100 200 300 400Time Sec
Event
Rate
event/
s
0
1
2
3
4
No
. re
mo
te n
od
es
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12Events/sec
Fre
quency
Slide: 30Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 30
Tcpdump of the Trigger DAQ Application
Slide: 31Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 31
tcpdump of the T/DAQ dataflow at SFI (1)Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
Incoming event request
Followed by ACK
N 1448 byte packets
SFI sends event
Limited by TCP receive buffer
Time 115 ms (~4 ev/s)
When TCP ACKs arrive
more data is sent.
●●●
Slide: 32Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 32
Tcpdump of TCP Slowstart at SFI (2)Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
First event request
N 1448 byte packets
SFI sends event
Limited by TCP Slowstart
Time 320 ms
When ACKs arrive
more data sent.
Slide: 33Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 33
tcpdump of the T/DAQ dataflow for SFI &SFO Cern-Manchester – another test run 1.0 Mbyte event Remote EFD requests events from SFI
Remote EFD sending computation back to SFO Links closed by Application
Link setup &
TCP slowstart
Slide: 34Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 34
Some Conclusions The TCP protocol dynamics strongly influence the behaviour of the Application. Care is required with the Application design eg use of timeouts. With the correct TCP buffer sizes
It is not throughput but the round-trip nature of the application protocol that determines performance.
Requesting the 1-2Mbytes of data takes 1 or 2 round trips TCP Slowstart (the opening of Cwnd) considerably lengthens time for the first block of data. Implementation “improvements” (Cwnd reduction) kill performance!
When the TCP buffer sizes are too small (default) The amount of data sent is limited on each rtt Data is send and arrives in bursts It takes many round trips to send 1 or 2 Mbytes
The End Hosts themselves CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power
Although the application is ATLAS-specific, the network interactions is applicable to other areas including: Remote iSCSI Remote database accesses Real-time Grid Computing – eg Real-Time Interactive Medical Image processing
Slide: 35Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 35
Radio Astronomy
e-VLBI
Slide: 36Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 36
Radio Astronomywith help from Ralph Spencer Jodrell Bank
The study of celestial objects at <1 mm to >1m wavelength. Sensitivity for continuum sources
B=bandwidth, integration time.
High resolution achieved by interferometers.
Some radio emitting X-ray binary stars in our own galaxy:
B/1
GRS 1915+105 MERLIN
SS433MERLIN and European VLBI
Cygnus X-1VLBA
Slide: 37Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 37
Earth-Rotation Synthesis and Fringes
Need ~ 12 hours for full synthesis, not necessarily collecting data for all that time.NB Trade-off between B and for sensitivity.
Telescope data correlated in pairs:N(N-1)/2 baselines
Merlin u-v coverage
Fringes Obtainedwith the correct signal phase
Slide: 38Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 38
The European VLBI Network: EVN
Detailed radio imaging uses antenna networks over 100s-1000s km
At faintest levels, sky teems with galaxies being formed
Radio penetrates cosmic dust - see process clearly
Telescopes in place … Disk recording at 512Mb/s Real-time connection allows
greater: response reliability sensitivity
Slide: 39Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 39
WesterborkNetherlands
Dedicated
Gbit link
EVN-NREN
OnsalaSweden
Gbit link
Jodrell BankUK
DwingelooDWDM linkCambridge
UK
MERLIN
MedicinaItaly
Chalmers University
of Technolog
y, Gothenbur
g
TorunPoland
Gbit link
Slide: 40Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 40
Gnt5-DwMk5 11Nov03-1472 bytes
0
2
4
6
8
10
12
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acket
loss
Gnt5-DwMk5
DwMk5-Gnt5
Throughput vs packet spacing Manchester: 2.0G Hz Xeon Dwingeloo: 1.2 GHz PIII Near wire rate, 950 Mbps NB record stands at 6.6 Gbps
SLAC-CERN
Packet loss
CPU Kernel Load sender
CPU Kernel Load receiver
4th Year project Adam Mathews Steve O’Toole
UDP Throughput Manchester-Dwingeloo (Nov 2003)
Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40Spacing between frames us
Recv W
ire r
ate
Mbits/s
Gnt5-DwMk5
DwMk5-Gnt5
Gnt5-DwMk5 11Nov03 1472 bytes
020406080
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% K
erne
l S
ende
r
Gnt5-DwMk5 11Nov03 1472 bytes
020406080
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% K
erne
l R
ecei
ver
Slide: 41Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 41
Packet loss distribution:
( )t
p d
Cumulative distribution
Cumulative distribution of packet loss, each bin is 12 sec wide
Long range effects inthe data?
Poisson
Slide: 42Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 42
26th January 2005 UDP Tests
Simon Casey (PhD project) Between JBO and JIVE in Dwingeloo, using production networkPeriod of high packet loss (3%):
Slide: 43Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 43
The GÉANT2 Launch June 2005
Slide: 44Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 44
Jodrell BankUK
DwingelooDWDM link
MedicinaItaly Torun
Poland
e-VLBI at the GÉANT2 Launch Jun 2005
Slide: 45Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 45
e-VLBI UDP Data Streams
Slide: 46Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 46
UDP Performance: 3 Flows on GÉANT
Throughput: 5 Hour run Jodrell: JIVE
2.0 GHz dual Xeon – 2.4 GHz dual Xeon670-840 Mbit/s
Medicina (Bologna): JIVE 800 MHz PIII – mark623 1.2 GHz PIII 330 Mbit/s limited by sending PC
Torun: JIVE 2.4 GHz dual Xeon – mark575 1.2 GHz PIII 245-325 Mbit/s limited by security policing (>400Mbit/s 20 Mbit/s) ?
Throughput: 50 min period Period is ~17 min
BW 14Jun05
0
200
400
600
800
1000
0 500 1000 1500 2000Time 10s steps
Rec
v w
ire r
ate
Mbi
t/s
JodrellMedicinaTorun
BW 14Jun05
0
200
400
600
800
1000
200 250 300 350 400 450 500Time 10s steps
Rec
v w
ire r
ate
Mbi
t/s
JodrellMedicinaTorun
Slide: 47Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 47
UDP Performance: 3 Flows on GÉANT
Packet Loss & Re-ordering Jodrell: 2.0 GHz Xeon
Loss 0 – 12% Reordering significant
Medicina: 800 MHz PIII Loss ~6% Reordering in-significant
Torun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant
Torun 14Jun04
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
020000
400006000080000100000
120000140000
num
lost
re-ordered
num_lost
jbgig1-jivegig1_14Jun05
0
500
1000
1500
2000
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
50000
100000
150000
num
lost
re-ordered
num_lost
Medicina 14Jun05
0
1
2
3
4
5
0 500 1000 1500 2000Time 10s
num
re-
orde
red
0
10000
20000
30000
40000
50000
60000
70000
num
lost
re-ordered
num_lost
Slide: 48Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 48
18 Hour Flows on UKLightJodrell – JIVE, 26 June 2005
Throughput: Jodrell: JIVE
2.4 GHz dual Xeon – 2.4 GHz dual Xeon
960-980 Mbit/s
Traffic through SURFnet
Packet Loss Only 3 groups with 10-150 lost packets
each No packets lost the rest of the time
Packet re-ordering None
man03-jivegig1_26Jun05
0
200
400
600
800
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Recv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
900910920930940950
960970980990
1000
5000 5050 5100 5150 5200
Time 10sR
ecv w
ire r
ate
Mbit/s
w10
man03-jivegig1_26Jun05
1
10
100
1000
0 1000 2000 3000 4000 5000 6000 7000
Time 10s steps
Packet
Loss
w10
Slide: 49Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 49
Host is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed:
NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK)NIC/drivers: CSR access / Clean buffer management / Good interrupt handling
Worry about the CPU-Memory bandwidth as well as the PCI bandwidthData crosses the memory bus at least 3 times
Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses32 bit 33 MHz is too slow for Gigabit rates64 bit 33 MHz > 80% used
Choose a modern high throughput RAID controllerConsider SW RAID0 of RAID5 HW controllers
Need plenty of CPU power for sustained 1 Gbit/s transfers Packet loss is a killer
Check on campus links & equipment, and access links to backbones New stacks are stable give better response & performance
Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale,
Application architecture & implementation is also important Interaction between HW, protocol processing, and disk sub-system complex
Summary, Conclusions
MB - NG
Slide: 50Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 50
More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications)
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Slide: 51Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 51
Any Questions?
Slide: 52Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 52
Backup Slides
Slide: 53Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 53
UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size
Slope is given by:
Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about:
Behavior of the IP stack The way the HW operates Interrupt coalescence
pathsdata dt
db1 s
Latency Measurements
Slide: 54Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 54
Throughput Measurements
UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals
n bytes
Number of packets
Wait timetime
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames at regular intervals
●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
Time
Sender Receiver
Slide: 55Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 55
PCI Bus & Gigabit Ethernet Activity
PCI Activity Logic Analyzer with
PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC
GigabitEthernetProbe
CPU
mem
chipset
NIC
CPU
mem
NIC
chipset
Logic AnalyserDisplay
PCI bus PCI bus
Possible Bottlenecks
Slide: 56Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 56
End Hosts & NICs CERN-nat-Manc.
Request-Response Latency
Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network
SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus
pcatb121-nat-gig6_13Aug04
0100200300400500600700800900
1000
0 10 20 30 40
Spacing between frames us
Rec
v W
ire r
ate
Mbi
ts/s
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
pcatb121-nat-gig6_13Aug04
0
20
40
60
80
0 5 10 15 20 25 30 35 40Spacing between frames us
% P
acke
t lo
ss 50 bytes
100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
pcatb121-nat-gig6_13Aug04
0
5
10
15
0 5 10 15 20 25 30 35 40Spacing between frames us
num
re-
orde
red
50 bytes
100 bytes 200 bytes
400 bytes 600 bytes
800 bytes 1000 bytes
1200 bytes 1400 bytes
1472 bytes
256 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
6000
20900 21100 21300 21500Latency us
N(t
)
512 bytes pcatb121-nat-gig6
0
2000
4000
6000
8000
10000
20900 21100 21300 21500Latency us
N(t
)
1400 bytes pcatb121-nat-gig6
0
1000
2000
3000
4000
5000
20900 21100 21300 21500Latency us
N(t
)
The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving
the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS
have no re-ordering
Slide: 57Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 57
TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:
for rtt of ~200 ms:
MSS
RTTC
*2
* 2
2 min
0.00010.0010.010.1
110
1001000
10000100000
0 50 100 150 200rtt ms
Tim
e t
o r
eco
ver
sec
10Mbit100Mbit1Gbit2.5Gbit10Gbit
UK 6 ms Europe 20 ms USA 150 ms
Slide: 58Richard Hughes-Jones
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 58
Network & Disk Interactions Disk Write
mem-disk: 1735 Mbit/s Tends to be in 1 die
Disk Write + UDP 1500 mem-disk : 1218 Mbit/s Both dies at ~80%
Disk Write + CPU mem mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used
Disk Write + CPUload mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in
user mode
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
y = -1.017x + 178.32
y = -1.0479x + 174.440
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k
64k
R0 6d 1 Gbyte udp Write 64k 3w8506-8
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
8k64ky=178-1.05x
RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384
y = -1.0215x + 215.63
y = -1.0529x + 206.46
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k total CPU
64k total CPU
R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k totalCPU
64k totalCPU
y=178-1.05x
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05xcut equn
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k totalCPU64k totalCPUy=178-1.05xcut equn 2
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k
64k
y=178-1.05x
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2
% c
pu
syste
m m
od
e L
3+
4
8k total CPU
64k total CPU
y=178-1.05x
Total CPU load
Kernel CPU load
R0 6d 1 Gbyte membw write 64k 3w8506-8 04Jan05 16384
0
500
1000
1500
2000
2500
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
Trial number
Th
rou
gh
pu
t M
bit
/s
Series1
L3+L4<cut