Upload
dava
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
How we quadrupled the network speed world record: bandwidth Challenge at SC 2004. J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology. Motivation and objectives. - PowerPoint PPT Presentation
Citation preview
CANS 12/1
J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. XiaJ. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. XiaCalifornia Institute of TechnologyCalifornia Institute of Technology
How we quadrupled the network speed world How we quadrupled the network speed world record: record: bandwidth Challenge at SC 2004bandwidth Challenge at SC 2004
CANS 12/1
Motivation and objectives Motivation and objectives
High Speed TeraByte Transfers for Physics: one-High Speed TeraByte Transfers for Physics: one-tenth of a terabits per second achieved (101 Gbps)tenth of a terabits per second achieved (101 Gbps)
Demonstrating that many 10 Gbps wavelengths can Demonstrating that many 10 Gbps wavelengths can be used efficiently over various continental and be used efficiently over various continental and transoceanic distances transoceanic distances
Preview of the globally distributed grid system that Preview of the globally distributed grid system that is now being developed in preparation for the next is now being developed in preparation for the next generation of high-energy physics experiments at generation of high-energy physics experiments at CERN's Large Hadron Collider (LHC).CERN's Large Hadron Collider (LHC).
Monitoring the WAN performance using MonALISAMonitoring the WAN performance using MonALISA Major Partners : Caltech-FNAL-SLAC Major Partners : Caltech-FNAL-SLAC Major Sponsors:Major Sponsors:
Cisco, S2io, HP, Newysis Cisco, S2io, HP, Newysis Major networks:Major networks:
NLR, Abilene, ESnet, LHCnet, AMPATH, TeraGridNLR, Abilene, ESnet, LHCnet, AMPATH, TeraGrid
CANS 12/1
5 NLR Waves dedicated to HEP5 NLR Waves dedicated to HEP
WDC
CalTech/NewmanCalTech/NewmanFL/AveryFL/Avery
SLACSLACHOPIHOPI
OptiputerOptiputerUW/Rsrch ChnlUW/Rsrch Chnl
PSCSVL
CHI
JAX
SC04
LA SD
Starlight
SEA
All lines 10GE
CalTech
NLR
-SEAT-SAND-10GE-7
NLR-STAR-SEAT-10GE-6 NLR-PITT-STAR-10GE-13NLR-PITT-LOSA-10GE-15
NLR-PITT-STAR-10GE-16NLR-PITT-SUNN-10GE-17
NLR-PITT-SUNN-10GE-17
NLR-PITT-LOSA-10GE-15
NLR
-PITT-LOSA-10GE-15
NLR
-PITT-LOSA-10GE-14
NLR-PITT-LOSA-10GE-14
NLR-PITT-LOSA-10GE-14NLR-PITT-SEAT-10GE-18
NLR-PITT-SEAT-10GE-18
NLR-PITT-WASH-10GE-21
NLR-P
ITT-JACK-10
GE-19
National Lambda Rail (www.nlr.net)National Lambda Rail (www.nlr.net) Cisco Long Haul DWDM equipment: Cisco Long Haul DWDM equipment:
ONS 15808 and ONS 15454ONS 15808 and ONS 15454 10 GE LAN PHY10 GE LAN PHY
CANS 12/1
Network (II)Network (II)
80 10GE ports80 10GE ports 50 10GE network adapters50 10GE network adapters 10 Cisco 7600/6500 dedicated to the experiment10 Cisco 7600/6500 dedicated to the experiment Backbone for the experiment built in two daysBackbone for the experiment built in two days Four dedicated wavelengths of NLR to Caltech boothFour dedicated wavelengths of NLR to Caltech booth
Los Angeles (2 waves),Los Angeles (2 waves), ChicagoChicago JacksonvilleJacksonville
One dedicated wavelengths of NLR to SLAC-FNAL booth from One dedicated wavelengths of NLR to SLAC-FNAL booth from SunnyvaleSunnyvale
Connections (L3 peerings) to showfloor via SCinet:Connections (L3 peerings) to showfloor via SCinet: ESnet ESnet TeraGrid TeraGrid AbileneAbilene
Balance traffic between WAN linksBalance traffic between WAN links UltraLight Autonomous system dedicated to the experiment (AS32361) UltraLight Autonomous system dedicated to the experiment (AS32361) L2 10 GE connections to LA, JACK and CHIL2 10 GE connections to LA, JACK and CHI Dedicated LSPs (MPLS Label Switch Path) on Abilene Dedicated LSPs (MPLS Label Switch Path) on Abilene
CANS 12/1
Selecting the data Transport protocolSelecting the data Transport protocol
Tests between CERN and CaltechTests between CERN and Caltech Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flowCapacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow Sending station:Sending station: Tyan S2882 motherboard, 2 x Opteron 2.4 GHz, Tyan S2882 motherboard, 2 x Opteron 2.4 GHz,
2 GB DDR.2 GB DDR. Receiving station:Receiving station: (CERN OpenLab): (CERN OpenLab): HP rx4640, 4 x 1.5GHz HP rx4640, 4 x 1.5GHz
Itanium-2, zx1 chipset, 8GB memoryItanium-2, zx1 chipset, 8GB memory Network adapter:Network adapter: S2io 10 GE S2io 10 GE
Linux TCP Linux Westwood+ Linux BIC TCP FAST
3.0 Gbps 5.0 Gbps 7.3 Gbps4.1 Gbps
CANS 12/1
Selecting 10 GE NICsSelecting 10 GE NICs
S2io Adapter
Intel Adapter
Max. b2b TCP throughput 7.5 Gbps 5.5 Gbps
Rx Frame Buffer Capacity 64MB 256KB
MTU 9600Byte 16114Byte
IPv4 TCP Large Send Offload
Max offload size
64k
Partial; Max offload
size 32k
TSO Must have hardware support in NIC. It allows TCP layer to send a larger than normal segment of data, e,g,
64KB, to the driver and then the NIC. The NIC then fragments the large packet into smaller (<=mtu) packets.
Benefits: TSO can reduce CPU overhead by 10%~15%. Increase TCP responsiveness.
p=(C*RTT*RTT)/(2*MSS)p: Time to recover to full ratep: Time to recover to full rate
C: Capacity of the linkC: Capacity of the linkRTT: Round Trip TimeRTT: Round Trip Time
MSS: Maximum Segment SizeMSS: Maximum Segment Size
S2io adapter
CANS 12/1
101 Gigabit Per Second Mark101 Gigabit Per Second Mark
101 GbpsUnstable
end-sytems
END of demo
Source: Bandwidth Challenge committee
CANS 12/1
PIT-CHI NLR wave utilizationPIT-CHI NLR wave utilization
BW
C
7.2 Gbps to FNAL
5 Gbps from FNAL
2 Gbps from Caltech@Starlight
2 Gbps to Caltech@Starlight
FAST TCP FAST TCP Performance to/from Performance to/from
FNAL are very stableFNAL are very stable Bandwidth utilization Bandwidth utilization
during BWC = 80 %during BWC = 80 %
CANS 12/1
Abilene Wave #1Abilene Wave #1
FAST TCPFAST TCP Traffic to CERN via Traffic to CERN via
Abilene & CHI Abilene & CHI Bandwidth utilization Bandwidth utilization
during BWC = 65 %during BWC = 65 %
BW
C
Instability due to traffic coming from
Brazil
CANS 12/1
Abilene Wave #2Abilene Wave #2
BW
C
TCP RenoTCP Reno Traffic to LA via Abilene & NY Traffic to LA via Abilene & NY Bandwidth utilization Bandwidth utilization
during BWC = 52.5 %during BWC = 52.5 % Max. throughput is better with Max. throughput is better with
RENO but FAST is more stableRENO but FAST is more stable Each time a packet is lost, the Each time a packet is lost, the
RENO flow has to be restartedRENO flow has to be restarted
CANS 12/1
Internet 2 Land Speed record Internet 2 Land Speed record submission submission
Multi-Streams 6.86 Gbps X 27 kkm Multi-Streams 6.86 Gbps X 27 kkm PATH: GVA-CHI-LA-PIT-CHI-GVAPATH: GVA-CHI-LA-PIT-CHI-GVA The LHCnet is crossed in both The LHCnet is crossed in both
directionsdirections RTT = 466 msRTT = 466 ms FAST TCP (18 streams)FAST TCP (18 streams)
Monitoring of LHCnet
CANS 12/1
An internationally collaborative effort: An internationally collaborative effort: Brazillian exampleBrazillian example
CANS 12/1
2 +1 Gbps network record on Brazillian 2 +1 Gbps network record on Brazillian networknetwork
CANS 12/1
A few lessons A few lessons
Logistics: server crashes, NIC driver problems, Last minutes Kernel Logistics: server crashes, NIC driver problems, Last minutes Kernel tuning, router misconfiguration, optics issues. tuning, router misconfiguration, optics issues.
Embarrassment of the rich: mix and match 3 10G SciNet drops with 2 Embarrassment of the rich: mix and match 3 10G SciNet drops with 2 Abilene and 3 TeraGrid waves. Better planning.Abilene and 3 TeraGrid waves. Better planning.
Mixing small flows (1Gbps) with large flows (10 Gbps) and different Mixing small flows (1Gbps) with large flows (10 Gbps) and different RTTs can be tricky.RTTs can be tricky.
Just filling in the leftover bandwidth?Just filling in the leftover bandwidth? Or negatively affect the large flows going over long distance?Or negatively affect the large flows going over long distance? Different level of FAST aggressiveness + packet losses on the path;Different level of FAST aggressiveness + packet losses on the path;
Efficient topological designEfficient topological design instead of star-like topology to source-sink all traffic at the showfloor;instead of star-like topology to source-sink all traffic at the showfloor; (routing traffic through a large router at showfloor but to remote servers.(routing traffic through a large router at showfloor but to remote servers.
Systematic tuning of the serversSystematic tuning of the servers Server crashes main reason of performance problems;Server crashes main reason of performance problems; 10GbE NIC drivers improvement;10GbE NIC drivers improvement; Better ventilation at the show;Better ventilation at the show; Using large default socket buffer versus setting large window by Using large default socket buffer versus setting large window by
applications (iperf)applications (iperf)