17
CANS 12/1 J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology California Institute of Technology How we quadrupled the network speed How we quadrupled the network speed world record: world record: bandwidth Challenge at SC 2004 bandwidth Challenge at SC 2004

How we quadrupled the network speed world record: bandwidth Challenge at SC 2004

  • Upload
    dava

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

How we quadrupled the network speed world record: bandwidth Challenge at SC 2004. J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology. Motivation and objectives. - PowerPoint PPT Presentation

Citation preview

CANS 12/1

J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. XiaJ. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. XiaCalifornia Institute of TechnologyCalifornia Institute of Technology

How we quadrupled the network speed world How we quadrupled the network speed world record: record: bandwidth Challenge at SC 2004bandwidth Challenge at SC 2004

CANS 12/1

Motivation and objectives Motivation and objectives

High Speed TeraByte Transfers for Physics: one-High Speed TeraByte Transfers for Physics: one-tenth of a terabits per second achieved (101 Gbps)tenth of a terabits per second achieved (101 Gbps)

Demonstrating that many 10 Gbps wavelengths can Demonstrating that many 10 Gbps wavelengths can be used efficiently over various continental and be used efficiently over various continental and transoceanic distances transoceanic distances

Preview of the globally distributed grid system that Preview of the globally distributed grid system that is now being developed in preparation for the next is now being developed in preparation for the next generation of high-energy physics experiments at generation of high-energy physics experiments at CERN's Large Hadron Collider (LHC).CERN's Large Hadron Collider (LHC).

Monitoring the WAN performance using MonALISAMonitoring the WAN performance using MonALISA Major Partners : Caltech-FNAL-SLAC Major Partners : Caltech-FNAL-SLAC Major Sponsors:Major Sponsors:

Cisco, S2io, HP, Newysis Cisco, S2io, HP, Newysis Major networks:Major networks:

NLR, Abilene, ESnet, LHCnet, AMPATH, TeraGridNLR, Abilene, ESnet, LHCnet, AMPATH, TeraGrid

CANS 12/1

Network (I)Network (I)

CANS 12/1

5 NLR Waves dedicated to HEP5 NLR Waves dedicated to HEP

WDC

CalTech/NewmanCalTech/NewmanFL/AveryFL/Avery

SLACSLACHOPIHOPI

OptiputerOptiputerUW/Rsrch ChnlUW/Rsrch Chnl

PSCSVL

CHI

JAX

SC04

LA SD

Starlight

SEA

All lines 10GE

CalTech

NLR

-SEAT-SAND-10GE-7

NLR-STAR-SEAT-10GE-6 NLR-PITT-STAR-10GE-13NLR-PITT-LOSA-10GE-15

NLR-PITT-STAR-10GE-16NLR-PITT-SUNN-10GE-17

NLR-PITT-SUNN-10GE-17

NLR-PITT-LOSA-10GE-15

NLR

-PITT-LOSA-10GE-15

NLR

-PITT-LOSA-10GE-14

NLR-PITT-LOSA-10GE-14

NLR-PITT-LOSA-10GE-14NLR-PITT-SEAT-10GE-18

NLR-PITT-SEAT-10GE-18

NLR-PITT-WASH-10GE-21

NLR-P

ITT-JACK-10

GE-19

National Lambda Rail (www.nlr.net)National Lambda Rail (www.nlr.net) Cisco Long Haul DWDM equipment: Cisco Long Haul DWDM equipment:

ONS 15808 and ONS 15454ONS 15808 and ONS 15454 10 GE LAN PHY10 GE LAN PHY

CANS 12/1

Network (II)Network (II)

80 10GE ports80 10GE ports 50 10GE network adapters50 10GE network adapters 10 Cisco 7600/6500 dedicated to the experiment10 Cisco 7600/6500 dedicated to the experiment Backbone for the experiment built in two daysBackbone for the experiment built in two days Four dedicated wavelengths of NLR to Caltech boothFour dedicated wavelengths of NLR to Caltech booth

Los Angeles (2 waves),Los Angeles (2 waves), ChicagoChicago JacksonvilleJacksonville

One dedicated wavelengths of NLR to SLAC-FNAL booth from One dedicated wavelengths of NLR to SLAC-FNAL booth from SunnyvaleSunnyvale

Connections (L3 peerings) to showfloor via SCinet:Connections (L3 peerings) to showfloor via SCinet: ESnet ESnet TeraGrid TeraGrid AbileneAbilene

Balance traffic between WAN linksBalance traffic between WAN links UltraLight Autonomous system dedicated to the experiment (AS32361) UltraLight Autonomous system dedicated to the experiment (AS32361) L2 10 GE connections to LA, JACK and CHIL2 10 GE connections to LA, JACK and CHI Dedicated LSPs (MPLS Label Switch Path) on Abilene Dedicated LSPs (MPLS Label Switch Path) on Abilene

CANS 12/1

Flows MatrixFlows Matrix

CANS 12/1

Selecting the data Transport protocolSelecting the data Transport protocol

Tests between CERN and CaltechTests between CERN and Caltech Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flowCapacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow Sending station:Sending station: Tyan S2882 motherboard, 2 x Opteron 2.4 GHz, Tyan S2882 motherboard, 2 x Opteron 2.4 GHz,

2 GB DDR.2 GB DDR. Receiving station:Receiving station: (CERN OpenLab): (CERN OpenLab): HP rx4640, 4 x 1.5GHz HP rx4640, 4 x 1.5GHz

Itanium-2, zx1 chipset, 8GB memoryItanium-2, zx1 chipset, 8GB memory Network adapter:Network adapter: S2io 10 GE S2io 10 GE

Linux TCP Linux Westwood+ Linux BIC TCP FAST

3.0 Gbps 5.0 Gbps 7.3 Gbps4.1 Gbps

CANS 12/1

Selecting 10 GE NICsSelecting 10 GE NICs

S2io Adapter

Intel Adapter

Max. b2b TCP throughput 7.5 Gbps 5.5 Gbps

Rx Frame Buffer Capacity 64MB 256KB

MTU 9600Byte 16114Byte

IPv4 TCP Large Send Offload

Max offload size

64k

Partial; Max offload

size 32k

TSO Must have hardware support in NIC. It allows TCP layer to send a larger than normal segment of data, e,g,

64KB, to the driver and then the NIC. The NIC then fragments the large packet into smaller (<=mtu) packets.

Benefits: TSO can reduce CPU overhead by 10%~15%. Increase TCP responsiveness.

p=(C*RTT*RTT)/(2*MSS)p: Time to recover to full ratep: Time to recover to full rate

C: Capacity of the linkC: Capacity of the linkRTT: Round Trip TimeRTT: Round Trip Time

MSS: Maximum Segment SizeMSS: Maximum Segment Size

S2io adapter

CANS 12/1

101 Gigabit Per Second Mark101 Gigabit Per Second Mark

101 GbpsUnstable

end-sytems

END of demo

Source: Bandwidth Challenge committee

CANS 12/1

PIT-CHI NLR wave utilizationPIT-CHI NLR wave utilization

BW

C

7.2 Gbps to FNAL

5 Gbps from FNAL

2 Gbps from Caltech@Starlight

2 Gbps to Caltech@Starlight

FAST TCP FAST TCP Performance to/from Performance to/from

FNAL are very stableFNAL are very stable Bandwidth utilization Bandwidth utilization

during BWC = 80 %during BWC = 80 %

CANS 12/1

Abilene Wave #1Abilene Wave #1

FAST TCPFAST TCP Traffic to CERN via Traffic to CERN via

Abilene & CHI Abilene & CHI Bandwidth utilization Bandwidth utilization

during BWC = 65 %during BWC = 65 %

BW

C

Instability due to traffic coming from

Brazil

CANS 12/1

Abilene Wave #2Abilene Wave #2

BW

C

TCP RenoTCP Reno Traffic to LA via Abilene & NY Traffic to LA via Abilene & NY Bandwidth utilization Bandwidth utilization

during BWC = 52.5 %during BWC = 52.5 % Max. throughput is better with Max. throughput is better with

RENO but FAST is more stableRENO but FAST is more stable Each time a packet is lost, the Each time a packet is lost, the

RENO flow has to be restartedRENO flow has to be restarted

CANS 12/1

Internet 2 Land Speed record Internet 2 Land Speed record submission submission

Multi-Streams 6.86 Gbps X 27 kkm Multi-Streams 6.86 Gbps X 27 kkm PATH: GVA-CHI-LA-PIT-CHI-GVAPATH: GVA-CHI-LA-PIT-CHI-GVA The LHCnet is crossed in both The LHCnet is crossed in both

directionsdirections RTT = 466 msRTT = 466 ms FAST TCP (18 streams)FAST TCP (18 streams)

Monitoring of LHCnet

CANS 12/1

An internationally collaborative effort: An internationally collaborative effort: Brazillian exampleBrazillian example

CANS 12/1

2 +1 Gbps network record on Brazillian 2 +1 Gbps network record on Brazillian networknetwork

                                                                                                                                                                                                                                                 

CANS 12/1

A few lessons A few lessons

Logistics: server crashes, NIC driver problems, Last minutes Kernel Logistics: server crashes, NIC driver problems, Last minutes Kernel tuning, router misconfiguration, optics issues. tuning, router misconfiguration, optics issues.

Embarrassment of the rich: mix and match 3 10G SciNet drops with 2 Embarrassment of the rich: mix and match 3 10G SciNet drops with 2 Abilene and 3 TeraGrid waves. Better planning.Abilene and 3 TeraGrid waves. Better planning.

Mixing small flows (1Gbps) with large flows (10 Gbps) and different Mixing small flows (1Gbps) with large flows (10 Gbps) and different RTTs can be tricky.RTTs can be tricky.

Just filling in the leftover bandwidth?Just filling in the leftover bandwidth? Or negatively affect the large flows going over long distance?Or negatively affect the large flows going over long distance? Different level of FAST aggressiveness + packet losses on the path;Different level of FAST aggressiveness + packet losses on the path;

Efficient topological designEfficient topological design instead of star-like topology to source-sink all traffic at the showfloor;instead of star-like topology to source-sink all traffic at the showfloor; (routing traffic through a large router at showfloor but to remote servers.(routing traffic through a large router at showfloor but to remote servers.

Systematic tuning of the serversSystematic tuning of the servers Server crashes main reason of performance problems;Server crashes main reason of performance problems; 10GbE NIC drivers improvement;10GbE NIC drivers improvement; Better ventilation at the show;Better ventilation at the show; Using large default socket buffer versus setting large window by Using large default socket buffer versus setting large window by

applications (iperf)applications (iperf)

CANS 12/1

PartnersPartners