30
Slide: 1 Richard Hughes-Jones e-VLBI Network Meeting 28 Jan 2005 R. Hughes-Jones Manchester 1 TCP/IP Overview & Performance Richard Hughes-Jones The University of Manchester MB - NG

Slide: 1 Richard Hughes-Jones e-VLBI Network Meeting 28 Jan 2005 R. Hughes-Jones Manchester 1 TCP/IP Overview & Performance Richard Hughes-Jones The University

Embed Size (px)

Citation preview

Slide: 1Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

1

TCP/IP Overview & Performance

Richard Hughes-JonesThe University of Manchester

MB - NG

Slide: 2Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

2

TCP (Reno) – What’s the problem?

TCP has 2 phases: Slowstart

Probe the network to estimate the Available BWExponential growth

Congestion AvoidanceMain data transfer phase – transfer rate glows “slowly”

AIMD and High Bandwidth – Long Distance networksPoor performance of TCP in high bandwidth wide area networks is due

in part to the TCP congestion control algorithm. For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½

Packet loss is a killer !!

Slide: 3Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

3

TCP (Reno) – Details Time for TCP to recover its throughput from 1 lost packet given by:

for rtt of ~200 ms:

MSS

RTTC

*2

* 2

2 min

0.00010.0010.010.1

110

1001000

10000100000

0 50 100 150 200rtt ms

Tim

e to

rec

ove

r se

c

10Mbit100Mbit1Gbit2.5Gbit10Gbit

Slide: 4Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

4

Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno)

For each ack in a RTT without loss:

cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss:

cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP

a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner

for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect

is that there is not such a decrease in throughput. Scalable TCP

a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed.

Fast TCP

Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput.

HSTCP-LP, H-TCP, BiC-TCP

Slide: 5Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

5

Packet Loss and new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms

Slide: 6Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

6

Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 180 ms 2.6.6 Kernel

Agreement withtheory good

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vabl

e th

roug

hput

M

bit/s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vabl

e th

roug

hput

M

bit/s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

Slide: 7Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

7

High Throughput Demonstrations

Manchester (Geneva)

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSR

Cisco GSR

Cisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100

Slide: 8Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

8

Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable

Slide: 9Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

9

High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins

Slide: 10Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

10

On the way to Higher Bandwidth

Slide: 11Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

11

End Hosts & NICs SuperMicro P4DP6

Latency

Throughput

Bus Activity

Use UDP packets from udpmon to characterise Host & NIC

SuperMicro P4DP6 motherboardDual Xenon 2.2GHz CPU400 MHz System bus66 MHz 64 bit PCI bus

gig6-7 Intel pci 66 MHz 27nov02

0

200

400

600

800

1000

0 5 10 15 20 25 30 35 40Transmit Time per frame us

Recv

Wire

rate

M

bits/

s

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

64 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

900

170 190 210

Latency us

N(t)

512 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

170 190 210Latency us

N(t)

1024 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

190 210 230

Latency us

N(t)

1400 bytes Intel 64 bit 66 MHz

0

100

200

300

400

500

600

700

800

190 210 230

Latency us

N(t)

Intel 64 bit 66 MHz

y = 0.0093x + 194.67

y = 0.0149x + 201.75

0

50

100

150

200

250

300

0 500 1000 1500 2000 2500 3000Message length bytes

Late

ncy u

s

Send PCI

Receive PCI

1400 bytes to NIC

1400 bytes to memory

PCI Stop Asserted

Slide: 12Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

12

Network switch limits behaviour End2end UDP packets from udpmon

Only 700 Mbit/s throughput

Lots of packet loss

Packet loss distributionshows throughput limited

w05gva-gig6_29May04_UDP

0

100

200

300

400

500

600

700

800

900

1000

0 5 10 15 20 25 30 35 40Spacing between frames us

Recv W

ire r

ate

Mb

its/s

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40Spacing between frames us

% P

acke

t lo

ss

50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes

w05gva-gig6_29May04_UDP wait 12us

0

2000

4000

6000

8000

10000

12000

14000

0 100 200 300 400 500 600Packet No.

1-w

ay d

ela

y u

s

0

2000

4000

6000

8000

10000

12000

14000

500 510 520 530 540 550Packet No.

1-w

ay d

elay

us

Slide: 13Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

13

TCP Window Scale factor not set correctly

SC2004 London-Chicago-London tests Server quality hosts – 2.8 GHz Dual Xeon; 133 MHz PCI-X bus TCP window scale factor should allow the pipe to be filled Delay*BW 22 Mbytes Web100 output shows:

Cwnd dows not open Data set at line speed

but as 1 burst/rtt Data stops at Cwnd Average throughput:

100 Mbit/s Limited by sender

Kernel configuration problem

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time ms

TC

PA

chiv

e M

bit

/s

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

Slide: 14Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

14

RAID0 6disks 1 Gbyte Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

Network & Disk Interactions Hosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size

Measure memory to RAID0 transfer rates with & without UDP trafficRAID0 6disks 1 Gbyte Write 64k 3w8506-8

y = -1.017x + 178.32

y = -1.0479x + 174.440

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k

64k

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp9000 write 64k 3w8506-8

0

500

1000

1500

2000

0.0 20.0 40.0 60.0 80.0 100.0Trial number

Thro

ughput

Mbit/s

R0 6d 1 Gbyte udp Write 64k 3w8506-8

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

8k64ky=178-1.05x

R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200% cpu system mode L1+2

% c

pu

syste

m m

od

e L

3+

4

8k

64k

y=178-1.05x

Disk write1735 Mbit/s

Disk write +1500 MTU UDP

1218 Mbit/sDrop 30%

Disk write +9000 MTU UDP

1400 Mbit/s

CPU load

Slide: 15Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

15

iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Average: Linespeed 940 Mbit/s DupACK ? <10 (expect ~400)

BaBar on Production network Standard TCP Average: 425 Mbit/s DupACKs 350-400 – re-transmits

Slide: 16Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

16

Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots:

Standard TCP Average 825 Mbit/s

Scalable TCP Average 875 Mbit/s

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBW

AveBW

CurCwnd (Value)

0

500

1000

1500

2000

2500

0 5000 10000 15000 20000

time ms

TC

PA

chiv

e M

bit

/s

050000001000000015000000200000002500000030000000350000004000000045000000

Cw

nd

InstaneousBWAveBWCurCwnd (Value)

Slide: 17Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

17

Parameters to Consider – Only some of them!

Server quality hosts Check that UDP packets use BW expected

Poor (old), or wrongly configured routers / switches Overloaded access links – campus / country

Hunt down packet loss at your desired sending rate Fill the pipe with packets in flight

set socket buffer to2* Delay*BW Kernel configuration settings:

Allow large socket buffer (TCP window) settings Set length of transmit queue large (~2000) TCP window scale factor should allow the pipe to be filled Disallow “Moderation” in TCP stack

Consider tuning off SACKs in 2.4.x and maybe up to 2.6.6 Large MTUs – reduces CPU load Enable Interrupt coalescence – reduces CPU load

Slide: 18Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

18

Real Time TCP in e-VLBI

Slide: 19Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

19

Does TCP delay the data ? Work in progress !! Send blocks of data (10kbytes) at regular intervals Drop every 10,000 packet Measure the arrival time of the data

mark5-g6_A0_10k_26Jan05

-7000

-6000

-5000

-4000

-3000

-2000

-1000

0

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

block number

del

t m

s

expect-send us

expect-recv us

0

20

40

60

80

100

120

140

0 10000 20000 30000 40000 50000 60000 70000 80000

pkts in

TC

PA

ch

ive M

bit

/s

InstaneousBW

BW in

Slide: 20Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

20

More Information Some URLs

MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

Slide: 21Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

21

Backup Slides

Slide: 22Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

22

UKLight in the UK

Slide: 23Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

23

SC2004 UKLIGHT Overview

MB-NG 7600 OSR

Manchester

ULCC UKlight

UCL HEP UCL network

K2

Ci

Chicago Starlight

Amsterdam

SC2004

Caltech BoothUltraLight IP

SLAC Booth

Cisco 6509

UKlight 10GFour 1GE channels

UKlight 10G

Surfnet/ EuroLink 10GTwo 1GE channels

NLR LambdaNLR-PITT-STAR-10GE-16

K2

K2 Ci

Caltech 7600

Slide: 24Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

24

Topology of the MB – NG Network

KeyGigabit Ethernet2.5 Gbit POS Access

MPLS Admin. Domains

UCL Domain

Edge Router Cisco 7609

man01

man03

Boundary Router Cisco 7609

Boundary Router Cisco 7609

RAL Domain

Manchester Domain

lon02

man02

ral01

UKERNADevelopment

Network

Boundary Router Cisco 7609

ral02

ral02

lon03

lon01

HW RAID

HW RAID

Slide: 25Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

25

Peak bandwidth 23.21Gbits/s 6.6 TBytes in 48 minutes

The Bandwidth Challenge at SC2003

10 Gbits/s throughput from SC2003 to Chicago & Amsterdam

0

1

2

3

4

5

6

7

8

9

10

11/19/0315:59

11/19/0316:13

11/19/0316:27

11/19/0316:42

11/19/0316:56

11/19/0317:11

11/19/0317:25 Date & Time

Throughput

Gbits/s

Router traffic to Abilele

Phoenix-Chicago

Phoenix-Amsterdam

Phoenix - Amsterdam 4.35 Gbit HighSpeed TCPrtt 175 ms , window 200 MB

Slide: 26Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

26

Average Transfer Rates Mbit/sApp TCP Stack SuperMicro on

MB-NG

SuperMicro on

SuperJANET4

BaBar on

SuperJANET4

Iperf Standard 940 350-370 425

HighSpeed 940 510 570

Scalable 940 580-650 605

bbcp Standard 434 290-310 290

HighSpeed 435 385 360

Scalable 432 400-430 380

bbftp Standard 400-410 325 320

HighSpeed 370-390 380

Scalable 430 345-532 380

apache Standard 425 260 300-360

HighSpeed 430 370 315

Scalable 428 400 317

Gridftp Standard 405 240

HighSpeed 320

Scalable 335

Slide: 27Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

27

bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks:

1200 Mbit/s read 600 Mbit/s write

Scalable TCP

BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s

SuperMicro + SuperJANET Instantaneous

400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s

SuperMicro + MB-NG Instantaneous

880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s

Slide: 28Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

28

Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET

bbcp

bbftp

Apachie

Gridftp

Previous work used RAID0(not disk limited)

Slide: 29Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

29

Host, PCI & RAID Controller Performance

RAID0 (striped) & RAID5 (stripped with redundancy) 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead

Slide: 30Richard Hughes-Jonese-VLBI Network Meeting 28 Jan 2005

R. Hughes-Jones Manchester

30

RAID Controller PerformanceR

AID

0R

AID

5

Read Speed Write Speed