28
1 Testbeds Les Cottrell Site visit to SLAC by DoE program managers Thomas Ndousse & Mary Anne Scott April 27, 2005 www.slac.stanford.edu/grp/scs/net/talk05/testbeds- apr05.ppt Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM)

1 Testbeds Les Cottrell Site visit to SLAC by DoE program managers Thomas Ndousse & Mary Anne Scott April 27, 2005

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

1

Testbeds

Les Cottrell

Site visit to SLAC by DoE program managers Thomas Ndousse & Mary Anne Scott

April 27, 2005

www.slac.stanford.edu/grp/scs/net/talk05/testbeds-apr05.ppt

Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance

Monitoring (IEPM)

2

• CalTech/UMich lead, NSF funded project– Hybrid circuits (IP & dedicated)

3

UltraScienceNet• ORNL lead, DoE funded

– Dedicated circuits

4

UL Testbed 10Gbits/s• Sunnyvale (interim until get ESnet 10Gbps circuits to

SLAC, July 2005):– Currently UltraLIght– Cisco 6509 from UltraLight proposal– Four Sun v20z 1.8GHz Opterons loaned from BaBar– 10GE TOE NICs loaned from Chelsio– 4 10GE Neterion(S2io) 10GE NICs purchased

• Installed with Solaris-10 and Linux 2.6– Will get file server from Caltech– Remote management

• Purchased/installed terminal server to provide console access• Purchased/installed remote power management

– Connect Cisco to 10Gbps UltraLight circuit– Interim USN IP connection imminent

5

Sunnyvale set up• Hosts have Solaris 1-, Linux 2.6, Neterio &

Chelsio 10GE NICs

.5

.3.6

A3

A4

A5

A6

A2

.8

.7

.4

Power

Console

10Gbits/s UltraLight (192.84.86.x)

CENIChttp://137.164.37.3 power management

UltraLight

10Mbps management(134.164.37.x)

.18

.19

TerminalServer

Hub

Compute servers

6

Approaching 10Gbps performance

• Jumbo frames (1500Bytes std => 9000Bytes), factor of 6 improvement in recovery rate– Not an IEEE standard– May break some UDP applications– Not supported on many LANs

• Sender mods only, HENP model is few big senders, lots of smaller receivers– Simplifies deployment, only a few hosts at a few sending

sites– So no Dynamic Right Sizing (DRS) at receiver

• XCP/ECN needs router mods so hard to deploy a new Internet

7

Hardware Assists• For 1Gbits/s paths, cpu, bus etc. not a problem• For 10Gbits/s they are important• NIC assistance to the CPU is becoming popular

– Checksum offload– Interrupt coalescence– Large send/receive offload (LSO/LRO)– TCP Offload Engine (TOE)

• Several vendors for 10Gbits/s NICs, at least one for 1Gbits/s NIC

• But currently restricts to using NIC vendor’s TCP implementation

• Most focus is on the LAN– Cheap alternative to Infiniband, MyriNet etc.

8

10Gbps test• Sunfire vx0z, Linux & Solaris 10, Chelsio &

Neterion• Back-to-back (LAN) testing at SLAC• SNV to LA• At SC2004 using two 10Gbps dedicated paths

between Pittsburgh and Sunnyvale– Using Solaris 10 (build 69) and Linux 2.6– On Sunfire Vx0z (dual & quad 2.4GHz 64 bit AMD Opterons)

with PCI-X 133MHz 64 bit– Only 1500 Byte MTUs

• Achievable performance limits (using iperf)– TOE (Chelsio) vs no TOE (Neterion(S2io))– LSO vs no LSO support– Solaris 10 vs Linux

• UDTv2 evaluation

9

CPU Utilization• Receiver needs 20% less CPU than sender

for high throughputCPU utilization vs throughput,

SLAC to CENIC-LA

y = 18.116xR2 = 0.9972

y = 14xR2 = 0.9782

0

20

40

60

80

100

120

0 1 2 3 4 5 6

Achievable throughput (Mbits/s)

% C

PU

Uti

lizat

ion

Receiver % cpuSender % cpuLinear (Sender % cpu)Linear (Receiver % cpu)

MTU: 9000BytesStreams 1v20z dual 1.8GHz OpteronS2io/Neterion

Sender+LSO

Receiver+LSO

For Neterion with LSO & Linux: Sender appears to use more CPU than receiver as the throughput increases

• Single stream limited by 1.8GHz CPU

10

Effect of Jumbos• Throughput SLAC-CENIC LA (1 stream, 2MB

window with LSO Neterion(S2io)/Linux):– 1500B MTU 1.8 Gbps – 9000B MTU 6 Gbps

• Sender CPU: GHz/Gbps (single stream with LSO Neterion/Linux): – 1500B MTU = 0.5 ± 0.13 GHz/Gbps– 9000B MTU = 0.3 ± 0.07 GHz/Gbps– Factor 1.7 improvement

For Neterion with LSO &Linux on WAN, Jumbos have a huge effect on performance and also improve CPU utilization

11

Effect of LSO• v20z 1.8GHz, Linux 2.6,

S2io, 2 streams SLAC to Caltech, 8MB window:– With LSO: 7.4Gbits/s, – Without LSO: 5.4Gbits/s,

• LAN (3 streams, 164KB window)– Solaris => Linux: 6.4Gbps (No

LSO support in Solaris 10 at the moment)

– Linux => Solaris-10: 4.8Gbps (LSO turned off sender)

– Linux => Solaris-10: 7.54Gbps (LSO turned on)

CPU Utilization vs throughput, SNV to CalTech

y = 0.1727xR2 = 0.9975

y = 0.2343xR2 = 0.9934

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8

Achievable throughput (Gbits/s)

% C

PU

uti

lizat

ion

MTU: 9000BStreams: 1txqueuelen: 1000

Ratio (1500B/9000B)~1.4

with LSO

without LSO

For Neterion with Linux on LAN LSO improves CPU utilization by a factor of 1.4. If one is CPU limited this will also improve throughput.

1 stream

12

Solaris vs Linux• Send from one to the other single stream• Compare send from Linux Neterion + LSO with send

from Solaris 10 without LSO– LSO support for Solaris coming soon

Achievable throughput vs Window size on LAN 1 stream

0

1

2

3

4

5

6

7

0 20 40 60Window (MB)

Th

rou

gh

pu

t (G

bit

s/s

) Linux to SolarisSolaris to Linux

MTU: 9400BMedian Linux: 5.9 GbpsMedian Solaris: 6.2 Gbps

• With one stream Solaris sender sends faster

• Sol slightly better GHz/Gbps GHz/Gbps: Solaris 0.287+-0.001; Linux 0.303+-0.001

13

Solaris vs Linux multi-

streamsWhen optimize for

multiple streams, Linux + LSO sender is better

Achievable throughput vs streams and window

0

1

2

3

4

5

6

7

8

0 5 10 15 20Streams

Th

rou

gh

pu

t (G

bit

s/s)

Solaris

Linux

1MB

2MB

4MB

• Solaris without LSO performs poorly with multiple streams (LSO or OS related?)– Its GHz/Gbps is poorer than Linux+LSO

for multiple streams

LANMTU: 9400BS2io

7.5Gbps

6.4Gbps

14

Chelsio• Chelsio to Chelsio (TOE)• With 2.4GHz V20zs from Pittsburgh to SNV• 1500Byte MTUs• Reliably able to get 7.4-7.5 Gbps (16 streams)• GHz/Gbps Chelsio(MTU=1500B) ~ Neterion (9000B)

Chelsio(TOE)

15

SLAC Connection• Part of ESnet Bay Area MAN

– Will be 4 * 10GE circuits, 2 in 2 out for ring– QWest will connect to Stanford in next fortnight– Then cross-connect to SLAC/Stanford fibers and

thus to SLAC• Working with Stanford to ID fiber pairs

16

Joint Caltech, SLAC, Joint Caltech, SLAC, FNAL, CERN, UF, FNAL, CERN, UF, SDSC, BR, KR, ….SDSC, BR, KR, ….

10 10 Gbps waves to 10 10 Gbps waves to HEP on show floorHEP on show floor

Bandwidth challenge: Bandwidth challenge: aggregate throughput aggregate throughput of 101.13 Gbpsof 101.13 Gbps

FAST TCPFAST TCP

SC2004: Tenth of a Terabit/s Challenge

17

Bandwidth Challenge

Large collaboration of academia and industryTook a lot of “wizards” to make it work

>100 Gbps aggregate

The prize!

18

Conclusions• UDT limit was ~ 4.45Gbits/s

– Cpu limited

• TCP Limit was about 7.5±0.07 Gbps, regardless of:– Whether LAN (back to back) or WAN

• TCP Gating factor=PCI-X 133Mhz ≡ 7.5Gbps• One host with 4 cpus & 2 NICs sent

11.5±0.2Gbps to two dual cpu hosts with 1 NIC each

• Two hosts to two hosts (1 NIC/host) on one 10Gbps link 9.07Gbps goodput forward & 5.6Gbps reverse

19

Conclusions• Jumbos can be a big help

• LSO is helpful (Neterion)

• For best throughput Linux+LSO sender better

• Without LSO Solaris provides more throughput

• Solaris without LSO has problems with multiple streams

• TOE (Chelsio) allows one to avoid 9000Byte MTUs

20

Conclusions• Need testing on real networks

– Controlled simulation & emulation critical for understanding

– BUT need to verify, and results can look different than expected

• Needs honest independent broker (SLAC)– Don’t care who wins, have the contacts, reputation,

testbeds etc.– Not really funded for this

21

Next Steps• Evaluate various offloads (TOE, LSO, LRO ...),

• Evaluate OS support: Solaris 10 support of LSO, untangle Solaris Linux, Chelsio/TOE on Solaris, leverage industry contacts

• New buses: PCI-X 266Mhz and PCI-Express important, need NICs/hosts to support then evaluate

• Install IEPM-BW on 10Gbps testbed– Evaluate existing tools at 10Gbits/s– Explore new tools for 10Gbits/s

• Exploit relationships with Neterion/Chelsio to work with packet pair timing aided by NICs

• Install Passive tools (on 10Gbps testbeds and work with BNL to help achieve mission))– Evaluate Netflow measurement & analysis at 10Gbits/s

• Privacy issues

– Use SNMP to access MIBs utilization etc.

22

Acknowledgements• Gary Buhrmaster*, Parakram Khandpur*, Harvey

Newmanc, Yang Xiac, Xun Suc, Dan Naec,Sylvain Ravotc, Richard Hughes-Jonesm, Michael Chen+, Larry McIntoshs, Frank Leerss, Leonid Grossmann, Alex Aizmann

• SLAC*, Caltechc, Manchester Universitym, Chelsio+, Suns, Neterion(S2io)n

23

Further Information• Web site with lots of plots & analysis

– www.slac.stanford.edu/grp/scs/net/papers/pfld05/ruchig/Fairness/

• Inter-protocols comparison (Journal of Grid Comp, PFLD04)– www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-10402.pdf

• SC2004 details– www-iepm.slac.stanford.edu/monitoring/bulk/sc2004/

24

FromLSOff

MTUBytes

Median ThruGbps IQR

GHz/Gbps IQR

Linux On 9400 7.395 0.0150.41

6 0.015

Linux On 1500 2.03 0.10.27

5 0.073

Linux Off 9400 4.75 0.0550.37

5 0.006

Solaris Off 9400 6.2 0.020.28

7 0.001

Solaris Off 1500 1.3 0.415 0.59 0.115

25

When will it have an impact

• ESnet traffic doubling/year since 1990• SLAC capacity increasing by 90%/year

since 1982– SLAC Internet traffic increased by factor 2.5

in last year• International throughput increase by factor

10 in 4 years• So traffic increases by factor 10 in 3.5 to 4

years, so in:– 3.5 to 5 years 622 Mbps => 10Gbps– 3-4 years 155 Mbps => 1Gbps– 3.5-5 years 45Mbps => 622Mbps

• 2010-2012:– 100s Gbits for high speed production net

end connections – 10Gbps will be mundane for R&E and

business– Home broadband: doubling ~ every year,

100Mbits/s by end of decade– Aggressive Goal: 1Gbps to all Californians

by 2010

Thr

ough

put M

bits

/s Throughput from US

26

What was special? • End-to-end application-to-application, single and multi-

streams (not just internal backbone aggregate speeds)• TCP has not run out of stream yet, scales from modem

speeds into multi-Gbits/s region– TCP well understood, mature, many good features: reliability etc.– Friendly on shared networks

• New TCP stacks only need to be deployed at sender– Often just a few data sources, many destinations– No modifications to backbone routers etc– No need for jumbo frames

• Used Commercial Off The Shelf (COTS) hardware and software

27

What was Special 2/2

• Raise the bar on expectations for applications and users– Some applications can use Internet backbone

speeds– Provide planning information

• The network is looking less like a bottleneck and more like a catalyst/enabler– Reduce need to colocate data and cpu– No longer ship literally truck or plane loads of data

around the world– Worldwide collaborations of people working with

large amounts of data become increasingly possible

28

Who needs it?

• HENP – current driver– Multi-hundreds Mbits/s and Multi TByte files/day transferred across

Atlantic today• SLAC BaBar experiment already has a PByte stored

– Tbits/s and ExaBytes (1018) stored in a decade

• Data intensive science:– Astrophysics, Global weather, Bioinformatics, Fusion, seismology…

• Industries such as aerospace, medicine, security …• Future:

– Media distribution• Gbits/s=2 full length DVD movies/minute

• 100 Gbits/s is equivalent to – Download Library of Congress in < 14 minutes– Three full length DVDs in a second

• Will sharing movies be like sharing music today?