Upload
sondra
View
47
Download
0
Embed Size (px)
DESCRIPTION
Long Distance experiments of Data Reservoir system. Data Reservoir / GRAPE-DR project. List of experiment members. The University of Tokyo Fujitsu Computer Technologies WIDE project Chelsio Communications Cooperating networks (West to East) WIDE APAN / JGN2 - PowerPoint PPT Presentation
Citation preview
K.Hiraki University of Tokyo
Long Distance experiments of Data Reservoir system
Data Reservoir / GRAPE-DR project
K.Hiraki University of Tokyo
List of experiment members
The University of Tokyo
Fujitsu Computer Technologies
WIDE project
Chelsio Communications
Cooperating networks (West to East)WIDEAPAN / JGN2IEEAF / Tyco TelecommunicationsNorthwest Giga PopCANARIE (CA*net 4)StarLightAbileneSURFnetSCinetThe Universiteit van Amsterdam CERN
K.Hiraki University of Tokyo
Computing System for real Scientists
• Fast CPU, huge memory and disks, good graphics– Cluster technology, DSM technology, Graphics processors
– Grid technology
• Very fast remote file accesses– Global file system, data parallel file systems, Replication facilities
• Transparency to local computation– No complex middleware, or no small modification to existing software
• Real Scientists are not computer scientists
• Computer scientists are not work forces for real scientists
K.Hiraki University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(1)
• Sharing Scientific Data between distant research institutes– Physics, astronomy, earth science, simulation data
• High utilization of available bandwidth– Transferred file data rate > 90% of available bandwidth
• Including header overheads, initial negotiation overheads
• OS and File system transparency– Storage level data sharing (high speed iSCSI protocol on stock TCP)– Fast single file transfer
K.Hiraki University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(2)
• GRAPE-DR:Very high-speed processor– 2004 – 2008– Successor of Grape-6 astronomical simulator
• 2PFLOPS on 128 node cluster system– 1G FLOPS / processor– 1024 processor / chip
• Semi-general-purpose processing– N-body simulation, gridless fluid dyinamics– Linear solver, molecular dynamics– High-order database searching (genome, protein data etc.)
K.Hiraki University of Tokyo
Grape6
Very High-speedNetwork
DataReservoir
Data analysis at University of Tokyo
Belle Experiments
X-ray astronomy Satellite ASUKA
SUBARUTelescope
NobeyamaRadio
Observatory( VLBI)
Nuclear experiments
DataReservoir
DataReservoirLocal
Accesses
DistributedShared
files
Data intensive scientific computation through global networks
Digital Sky Survey
K.Hiraki University of Tokyo
DataReservoir
DataReservoir
High latencyVery high bandwidthNetwork
Distribute Shared Data
(DSM like architecture)
Cache Disks
Cache Disks
Local file accesses
Disk-block levelParallel andMulti-stream transfer
Local file accesses
Basic Architecture
K.Hiraki University of Tokyo
Disk Server
Scientific Detectors
User Programs
IP Switch
File Server File Server
Disk Server
IP Switch
File Server File Server
Disk Server Disk Server
1st level striping
2nd level striping
Disk access by iSCSI
File accesses on Data Reservoir
IBM x345 (2.6GHz x 2)
K.Hiraki University of Tokyo
Scientific Detectors
User Programs
File Server File Server File Server File Server
IP Switch IP Switch
Disk Server Disk ServerDisk Server Disk Server
iSCSI Bulk Transfer
Global Network
Global Data Transfer
K.Hiraki University of Tokyo
1st Generation Data Reservoir
• 1st generation Data Reservoir (2001-2002)• Efficient use of network bandwidth (SC2002 technical paper)• Single filesystem image (by striping of iSCSI disks)• Separation of local file accesses by users to remote disk to
disk replication• Low level data-sharing (disk block level, iSCSI protocol)
• 26 servers for 570 Mbps data transfer• High sustained efficiency (≒93 %)
• Low TCP performance• Unstable TCP performance
K.Hiraki University of Tokyo
Problems of BWC2002 experiments
• Low TCP bandwidth due to packet losses– TCP congestion window size control– Very slow recovery from fast recovery phase (>20min)
• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– User and other network users have interests only to
total behavior of parallel TCP streams
K.Hiraki University of Tokyo
2nd Generation Data Reservoir
• 2nd generation Data Reservoir (2003)• Inter-layer coordination for parallel TCP streams (SC2004
technical paper)» Transmission Rate Controlled TCP for Long Fat pie
Netowks» “Dulling Edges of Cooperative Parallel streams
(DECP (load balancing among parallel TCP streams)» LFN tunneling NIC (Comet TCP)
• 10 times performance increase from 2002 » 7.01 Gbps (disk to disk, TRC-TCP), 7.53 Gbps
(CometTCP)
K.Hiraki University of Tokyo
24,000km(15,000miles)
15,680km (9,800miles)
OC-192
OC-48 x 3GbE x 4
8,320km (5,200miles)Juniper
T320
K.Hiraki University of Tokyo
Results• BWC2002
– Tokyo → Baltimore 10,800km (6,700miles)– Peak bandwidth (on network) 600 Mbps– Average file transfer bandwidth 560 Mbps– Bandwidth-distance products 6,048 terabit-kilometers/second
• BWC 2003– Phoenix → Tokyo → Portland → Tokyo 24,000 km (15,000 miles)
– Average file transfer bandwidth 7.5 Gbps– Bandwidth-distance products ≒ 168 petabit-kilometers/second
– More than 25 times improvement from BWC2002 performance (bandwidth-distance products)
– Still low performance for each TCP stream (460 Mbps)
K.Hiraki University of Tokyo
Problems of BWC2003 experiments
• Still Low TCP single stream performance– NIC is GbE– Software overheads
• Large number of server and storage boxed for 10Gbps iSCSI disk service– Disk density– TCP single stream performance– CPU usage for managing iSCSI and TCP
K.Hiraki University of Tokyo
3rd Generation Data Reservoir
• Hardware and software basis for 100Gbps Distributed Data-sharing systems
• 10Gbps disk data transfer by a single Data Reservoir server• Transparent support for multiple filesystems (detection of
modified disk blocks)• Hardware(FPGA) implementation of Inter-layer coordination
mechanisms• 10 Gbps Long Fat pipe Network emulator and 10 Gbps data
logger
K.Hiraki University of Tokyo
Features of a single server Data Reservoir
Low cost and high-bandwidth disk service• A server + two RAID card + 50 disks• Small size, scalable to 100Gbps Data Reservoir (10 servers --- 2
racks)• Mother system for GRAPE-DR simulation engine• Compatible to existing Data Reservoir system
• Single stream TCP performance (must be more than 5 Gbps)• Automatic adaptation for network condition
– Maximize multi-stream TCP performance
– Load balancing among parallel TCP streams
K.Hiraki University of Tokyo
Utilization of 10Gbps network
• A single box 10 Gbps Data Reservoir server• Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server)• Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth• Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger)
ChelsioT110
TCP NICQuad Opteron
Server(SUN V40z)Linux 2.6.6
PCI-X bus
PCI-X bus
PCI-X bus
ChelsioT110
TCP NIC
SCSIadaptor
PCI-X bus
SCSIadaptor
10G EthernetSwitch
10GBASE-SR
Data Data ReservoirReservoirSoftwareSoftware
Ultra320SCSI
K.Hiraki University of Tokyo
Single stream TCP performance
• Importance of hardware-supported “Inter-layer optimization” mechanism
– CometTCP hardware (TCP tunnel by non-TCP protocol)
– Chelsio T110 NIC (pacing by network processor)– Dual ported FPGA based NIC (under development
• Standard TCP, standard Ethernet frame size
• Major problem– Conflict between long RTT and
short RTT– Detection of optimal parameters
Chelsio T110
Dual ported NIC by U. of Tokyo
K.Hiraki University of Tokyo
Tokyo-CERN experiment (Oct.2004)
• CERN-Amsterdam-Chicago-Seattle-Tokyo– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE
– 18,500 km WAN PHY connection
• Performance result
– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf
– 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf
– 8.8 Gbps disk to disk performance• 9 servers, 36 disks• 36 parallel TCP streams
K.Hiraki University of Tokyo
Network topology of CERN-Tokyo experiment
Tokyo
Seattle Vancouver
Minneapolis Chicago
Amsterdam CERN(Geneva)
IBM x345 serverDual Intel Xeon 2.4GHz
2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)
IBM x345
IBM x345
Opteron serverDual Opteron248,2.2GHz
1GB memoryLinux 2.6.6 (No.2-6)
ChelsioT110 NIC Fujitsu XG800
12 port switch
FoundryBI MG8
Data Reservoir at Univ. of Tokyo
GbE
IBM x345 serverDual Intel Xeon 2.4GHz
2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)
IBM x345
IBM x345
Opteron serverDual Opteron248,2.2GHz
1GB memoryLinux 2.6.6 (No.2-6)
ChelsioT110 NIC
Data Reservoir at CERN(Geneva)
GbE
T-LEX
Pacific Northwest Gigapop
StarLight
NetherLight
WIDE / IEEAF
CA*net 4 SURFnet
10G
BA
SE
-LW
FujitsuXG800
FoundryFEX x448
FoundryNetIron40G
ExtremeSummit 400
K.Hiraki University of Tokyo
Difficulty on CERN-Tokyo Experiment
• Detecting SONET lever errors– Synchronization error– Low rate packet losses
• Lack of tools for debugging network– Loopback at L1 is the only useful method– No tools for performance bugs
• Very long latency
K.Hiraki University of Tokyo
SC2004 bandwidth challenge (Nov.2004)
• CERN-Amsterdam-Chicago-Seattle-Tokyo – Chicago - Pittsburgh– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE – JGN2 –
Abilene – 31,248 km (RTT 433 ms) (but shorter distance by LSR
rules, 20,675 km)
• Performance result
– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf
– 1.6 Gbps disk to disk performance• 1 servers, 8 disks• 8 parallel TCP streams
K.Hiraki University of Tokyo
Bandwidth Challenge: Pittsburgh-Tokyo-CERN Single stream TCP
31,248 km connection between SC2004 booth and CERN through Tokyo
Latency 433 ms RTT
Data Reservoir booth at SC2004
Data Reservoir project at CERN(Geneva)
Tokyo
T-LEX
Amsterdam
NetherLight
Tokyo
APAN
WIDE APAN/JGN II Abilene SCinet SURFnetIEEAF/Tyco/WIDE CANARIE
Router or L3 switch L3 switch used as L2
University of Amsterdam
Chicago
StarLightSCinet backborn
Opteron4
Opteron2
L1 or L2 switch
CERNFoundryNetIron
40G
Force10Switch
ONS15454
Fujitsu XG800Switch
ChelsioT110 NIC
Opteron serverDual Opteron248,
2.2GHz1GB memory
Linux 2.6.6 (No.2-6)
HDXc
SURFnet
(at CERN)
ONS15454
ONS15454
Calgary
ONS15454
ONS15454
Vancouver
Seattle
Pacific Northwest Gigapop
FoundryNetIron
40G
ProcketRouter
ProcketRouter
Force10Switch
JuniperT640
Router
Force10Switch
JuniperT320
Router
FoundryBI MG8
ChelsioT110 NIC
Opteron serverDual Opteron248,
2.2GHz1GB memory
Linux 2.6.6 (No.2-6)
ONS15454
Minneapolis
Atlantic
Ocean
Pacific
Ocean
K.Hiraki University of Tokyo
Pittsburgh-Tokyo-CERN Single stream TCP speed measurement
• Server configuration• CPU: Dual AMD Opteron 248, 2.2GHz• Mother Board: RioWorks HDAMA rev. E• Memory: 1G bytes, Corsair Twinx CMX512RE-3200LLPT x 2, PC3200, CL=2• OS: Linux kernel 2.6.6• Disk Seagate IDE 80GB, 7200r.p.m. (disk speed not essential for performance)
• Network interface card• Chelsio T110 (10GBASE-SR), TCP offload engine (TOE),• PCI-X/133 MHz I/O bus connection • TOE function ON
Technical points• Traditional tuning and optimization methods
• congestion window size, buffer size, size of queue etc. • Ethernet frame size (standard frame, jumbo frame)
• Tuning and optimization by Inter-layer coordination ( Univ. of Tokyo)
• Fine-grained pacing by offload engine• optimization at slow start phase • Utilization of flow control
• Performance of single stream TCP on 31,248 km circuit
• 7.21 Gbps (TCP payload), standard Ethernet frame ⇒ 224,985 terabit meter / second
80% more than previous Land Speed Record
System
Tokyo
CERNPittsburgh
Chicago
Amsterdam
GenevaSeattle
VancouverCalgary
Minneapolis
WIDEAPAN/JGN II
IEEAF/Tyco/WIDE
CANARIE
SURFnetAbilene
Network used in the experiment
Observed bandwidth on the network(at Tokyo T-LEX NI40G
K.Hiraki University of Tokyo
Difficulty on SC2004 LSR challenge
• Method for performance debugging– Dividing the path (CERN to Tokyo, Tokyo to Pittsburgh)
• Jumbo frame performance (unknown reason)• Stability in network • QoS problem in the Procket 8800 routers
K.Hiraki University of Tokyo
Year end experiments
• Target– > 30,000 km LSR distance
– L3 switching at Chicago and Amsterdam
– Period of the experiment• 12/20 – 1/3 • Holiday season for vacant public research networks
• System configuration– A pair of opteron servers with Chelsio T110 (at N-otemachi)
– Another pair of opteron servers with Chelsion T110 for competing traffinc generation
– ClearSight 10Gbps packet analyzer for packet capturing
K.Hiraki University of Tokyo
Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo
Tokyo
T-LEX
Amsterdam
NetherLight
SURFnetIEEAF/Tyco/WIDE CANARIE
Router or L3 switch
University of Amsterdam
Chicago StarLight
L1 or L2 switch
Force10E1200
ONS15454
Vancouver
FoundryNetIron
40G
OME6550
Minneapolis
Atlantic
Ocean
Pacific
Ocean
Opteron1
Opteron server
ChelsioT110 NIC
IEEAF/Tyco
Opteron server
ChelsioT110 NIC
ClearSight10Gbpscapture
FujitsuXG800
OC-192
WAN PHY
WAN PHY
CANARIE
SURFnet
OME6550
Procket8801
Procket8812
ONS15454
ONS15454
ONS15454
ONS15454
ONS15454
HDXc
T640
T640
HDXc
CISCO12416
CISCO12416
CISCO6509
Force10E600Opteron3
SeattlePacific
Northwest Gigapop
New York
MANLAN
Chicago
SURFnet
SURFnet
OC-192
OC-192Abilene
APAN/JGN
TransPAC
SURFnet
SURFnet
SURFnet
WIDE
WIDE
Calgary
Univ of Tokyo WIDE APAN/JGN2Abilene
K.Hiraki University of Tokyo
1545
4
1545
4
6509
1241
6 A
R5
T64
0
HD
Xc
OM
E65
00
1545
4T
640
PR
O88
01
E12
00
PR
O88
1215
454
opteron1T110NIC
opteron3T110NIC
NI4
0G
T-LEXData Reservoir
Chicago NYC A’damTokyo/notemachi Seattle
Abilene
CA*net 4 SURFnet
APAN
StarLight
SURFnetMANLAN
HD
Xc
v0.06, Dec 21, 2004, [email protected]
VLAN A (911) VLAN B (10)
VLAN D (914) "operational and production-oriented high-performance research and educations networks"
E60
0
OM
E65
00
UW
Vancouver
1241
6 C
R1
OC
-192
OC
-192
OC
-192
OC
-192
10G
E-L
W
10G
E-S
R
10G
E-L
R
OC
-192
OC
-192
OC
-192
OC
-192
OC
-192
10G
E-L
W
10G
E-L
W
10G
E-L
W
10G
E-L
R10
GE
-LR
OC
-192
OC
-192
10G
E-L
R
10G
E-L
R
UvA
IEE
AF
JGN
II
TransPAC
GX
10G
E-S
R
ClearSight10GE analyzer
XG
1200
tap
Figure 1a. Network topology
K.Hiraki University of Tokyo
Tokyo
Chicago
Amsterdam
SeattleVancouver
Calgary
MinneapolisIEEAF/Tyco/WIDE
CANARIE
SURFnet
Network used in the experiment
Figure 2. Network connection
CA*net 4
APAN/JGN2Abilene
NYC
A router or an L3 switch
A L1 or L2 switch
K.Hiraki University of Tokyo
Difficulty on Tokyo-AMS-Tokyo experiment
• Packet loss and Flow control– About 1 loss/several minutes when flow control off– About 100 losses/several minutes when flow control on– Flow control at edge switch does not work properly
• Difference of network behavior by the direction• Artificial bursty behavior caused by switches and routers
– Outgoing packets from NIC are paced properly.
• Jumbo frame packet losses– Packet losses begin from MTU 4500– NI40G and Force10 E1200, E600 are on the path.
• Lack of peak network flow information– 5 min average BW is useless for TCP traffic
K.Hiraki University of Tokyo
Network Traffic on routers and switches
StarLight Force10 E1200
University of Amsterdam Force10 E600
Abilene T640 NYCM to CHIN
TransPAC Procket 8801
Submitted run
K.Hiraki University of Tokyo
History of single-stream IPv4 Internet Land Speed Record
2000 2001 2003 2004 2005 2006 2007
Year
1,000
10,000
100,000
Distance bandwidth productTbit m / s
2004/11/9Data Reservoir project
WIDE project
148850 Tbit m / sInternet Land Speed Record
2002
1,000,0002004/12/25
Data Reservoir projectWIDE project
SURFnetCANARIE
216300 Tbit m / sExperimental result
K.Hiraki University of Tokyo
Setting for instrumentation
• Capture packet at both source-end and destination-end
• Packet filtering by src-dst IP address
• Capture up to 512 MB
• Precise packet counting was very useful
SourceServer
DestinationServer
NetworkTo be tested
ClearSight10Gbps
Capture system
Fujitsu XG80010G L2 switch
ClearSight
SourceServer
DestinationServer
Tap
XG800
K.Hiraki University of Tokyo
Packet count example
K.Hiraki University of Tokyo
Traffic monitor example
K.Hiraki University of Tokyo
Possible arrangement of devices
• 1U loss-less packet capturing box developed at U. of Tokyo– 2 10GBASE-LR interface– Lossless capture up to 1GB– Precise time stamping
• Packet generating server– 3U opteron server + 10Gbps Ethernet adaptor– Pushing network by more than 5 Gbps TCP/UDP
• Taps and tap selecting switch– Currently, port mirroring is not good for instrumentation– Capturing backborn traffic and server in/out
K.Hiraki University of Tokyo
2 port packet capturing box
• Originally designed for emulating 10Gbps network– 2 10GBASE-LR interface– 0 to 1000 ms artificial latency (very long FIFO queue)– Error rate 0 to 0.1 % ( 0.001% step)
• Lossless packet capturing– Utilize FIFO queue capability – Data uploadable through USB– Cheaper than ClearSight or Agilent Technology
K.Hiraki University of Tokyo
Setting for instrumentation
• Capture packet at both source-end and destination-end
• Packet filtering by src-dst IP address
BackbornRouter
OrSwitch
Server +10G Ether
NetworkTo be tested
2 portPacket capture
Box
• Capture packet at both source-end and destination-end
• Packet filtering by src-dst IP address
NetworkTo be tested
TAP
K.Hiraki University of Tokyo
More expensive plan
• Capture packet at both source-end and destination-end
• Packet filtering by src-dst IP address
BackbornRouter
OrSwitch
Server +10G Ether
NetworkTo be tested
2 portPacket capture
Box
• Capture packet at both source-end and destination-end
• Packet filtering by src-dst IP address
NetworkTo be tested
TAP
TAP
2 portPacket capture
Box
K.Hiraki University of Tokyo
Summary• Single Stream TCP
– We removed TCP related difficulties– Now I/O bus bandwidth is the bottleneck– Cheap and simple servers can enjoy 10Gbps network
• Lack of methodology in high-performance network debugging– 3 day debugging (overnight working)– 1 day stable period (usable for measurements)– Network may feel fatigue, some trouble must happen
– We need something effective.
• Detailed issues– Flow control (and QoS)– Buffer size and policy– Optical level setting
K.Hiraki University of Tokyo
Thanks