20
USATLAS Network/Storage and Load Testing Jay Packard Jay Packard Dantong Yu Dantong Yu Brookhaven National Lab Brookhaven National Lab

USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Embed Size (px)

Citation preview

Page 1: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

USATLAS Network/Storage and Load Testing

Jay Packard Jay Packard

Dantong YuDantong Yu

Brookhaven National LabBrookhaven National Lab

Page 2: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

2

Outline

USATLAS Network/Storage Infrastructures: Platform of USATLAS Network/Storage Infrastructures: Platform of Performing Load Test. Performing Load Test.

Load Test Motivation and Goals.Load Test Motivation and Goals.

Load Test Status Overview.Load Test Status Overview.

Critical Components in Load Test: Control and Monitoring Critical Components in Load Test: Control and Monitoring and Network Monitoring and Weather Maps. and Network Monitoring and Weather Maps.

Detailed Plots for Single v.s. Multiple host Load Tests.Detailed Plots for Single v.s. Multiple host Load Tests.

Problems. Problems.

Proposed Solutions: Proposed Solutions: Network Research and Its Role in Dynamic Layer 2 Circuits

between BNL and US ATLAS Tier 2 sites.

Page 3: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

3

BNL 20 Gig-E Architecture Based on CISCO6513

20 GBps LAN for 20 GBps LAN for LHCOPNLHCOPN

20GBps for Production 20GBps for Production IPIP

Full Redundancy: can Full Redundancy: can survive the failure of any survive the failure of any network switch. network switch.

No Firewall for No Firewall for LHCOPN, as shown in the LHCOPN, as shown in the green lines.green lines.

Two Firewalls for all Two Firewalls for all other IP networks.other IP networks.

Cisco Firewall Services Cisco Firewall Services Module (FWSM), a line Module (FWSM), a line card plugged into CISCO card plugged into CISCO chassis with 5*1Gbps chassis with 5*1Gbps capacity, allows outgoing capacity, allows outgoing connection.connection.

Page 4: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

4

20 Gb/s

HPSS Mass Storage System

dCache SRM and

Core Servers

Gridftp door

(8 nodes)

2x10 Gb/s WAN

BNL LHC OPN VLAN

Write PoolFarm Pool (434 nodes / 360 TB)

8 x 1 Gb/s

Tier 1 VLANS

2x10 Gb/s

8 x 1 Gb/s

dCache

. . . . N x 1 Gb/s . . . .

20 Gb/s

Logical Connections

FTS controlledSrmcp path

T0 Export Pool (>=30 nodes)

New Farm Pool (80 nodes, 360TB Raw )

Thumpers (30 nodes, 720TB Raw )

dCache and Network Integration

ESnetESnet

Load Testing

Hosts

New Panda and

Panda DB.

Page 5: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

5

Tier 2 Network Example: ATLAS Great Lakes Tier 2

Page 6: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

6

Need More Details of Tier 2 Network/Storage Infrastructures

Hope to see architectural maps from each Tier2 to describe Hope to see architectural maps from each Tier2 to describe the integration of Tier 2 network and production and testing the integration of Tier 2 network and production and testing storage systems in the site reports. storage systems in the site reports.

Page 7: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

7

Goal

Develop a toolkit for testing and viewing I/O performance at Develop a toolkit for testing and viewing I/O performance at various middleware layers (network, grid-ftp, FTS) in order to various middleware layers (network, grid-ftp, FTS) in order to isolate problems.isolate problems.

Single-host transfer optimization at each layer.Single-host transfer optimization at each layer. 120 MB/s is ideal for memory to memory transfer and high

performance storage. 40 MB/s is ideal for disk transfer to a regular worker node.

Multi-host transfer optimization for site with 10Gbps Multi-host transfer optimization for site with 10Gbps connectivity.connectivity. Starting point: Sustained 200 MB/s disk-to-disk transfer for 10

minutes between Tier1 and each Tier2 is goal (Rob Gardner). Then increase disk-to-disk transfer to 400MBytes/second. For site with 1Gbps bottleneck, we should max out the network

capacity.

Page 8: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Status Overview

MonALISA control application has been developed for MonALISA control application has been developed for specifying single-host transfer, protocol, duration, size, specifying single-host transfer, protocol, duration, size, stream range, tcp buffer range, etc. Currently only run by stream range, tcp buffer range, etc. Currently only run by Jay Packard at BNL, but may eventually be run by multiple Jay Packard at BNL, but may eventually be run by multiple administrators at other sites within MonALISA framework.administrators at other sites within MonALISA framework.

MonALISA monitoring plugin has been developed to display MonALISA monitoring plugin has been developed to display current results in graphs. They are available in monALISA current results in graphs. They are available in monALISA client ( client ( http://http://monalisamonalisa..cacrcacr..caltechcaltech..eduedu/ml_client//ml_client/MonaLisaMonaLisa..jnlpjnlp) and will soon be available on a web page.) and will soon be available on a web page.

Page 9: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Status Overview...

Have been performing single-host tests for past few months. Have been performing single-host tests for past few months. Types

Network memory to memory (using iperf) Grid-ftp memory to memory (using globus-url-copy) Grid-ftp memory to disk (using globus-url-copy) Grid-ftp disk to disk (using globus-url-copy)

At least one host at each site has been TCP tuned, which has shown dramatic improvements at some sites in the graphs (e.g. 5 MB/s to 100 MB/s for iperf tests)

If Tier 2 has 10Gbps, there is significant improvement for single TCP stream, from 50mbps to close to 1Gbps. (IU, UC, BU, UMich).

If Tier 2 has 1Gbps bottleneck, network performance can be improved with multiple TCP streams. Simple TCP buffer size tune can not improve single TCP stream performance due to bandwidth competition.

Discovered problems: dirty fiber, CRC error on network interface, and moderate TCP buffer size, details can be found in Shawn’s talk.

Coordinating with Michigan and BNL (Hiro Ito, Shawn McKee, Robert Gardner, Jay Packard) to measure and optimize total throughput using FTS disk-to-disk. We are trying to leverage high performance storage (Thumper at BNL and Dell NAS at Michigan) to achieve our goal.

Page 10: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

MonALISA Control Application

Our Java class implements MonALISA's AppInt interface as a plug-in.Our Java class implements MonALISA's AppInt interface as a plug-in.

900 lines of code currently.900 lines of code currently.

Does the following:Does the following: Generates and prepares source files for disk to disk transfer Starts up remote iperf server and local iperf client using globus-job-run

remotely and ssh locally Runs iperf or grid-ftp for a period of time and collects output Parses output for average and maximum throughput Generates output understood by monitoring plugin Cleans up destination files Stops iperf servers

Flexible to account for heterogeneous sites (e.g., killing iperf is done Flexible to account for heterogeneous sites (e.g., killing iperf is done differently on a managed fork gatekeeper; one site runs BWCTL differently on a managed fork gatekeeper; one site runs BWCTL instead of iperf). This flexibility in the code requires frequently watching instead of iperf). This flexibility in the code requires frequently watching the output of the application and augmenting the code to handle many the output of the application and augmenting the code to handle many circumstances.circumstances.

Page 11: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

MonALISA Control Application...

• Generates the average and maximum throughput during a 2 minute interval, which interval is required for the throughput to “ramp up”.

• Sample configuration for grid-ftp memory-to-disk:command=gridftp_m2dstartHours=4,16envScript=/opt/OSG_060/setup.shfileSizeKB=5000000streams=1, 2, 4, 8, 12repetitions=1repetitionDelaySec=1numSrcHosts=1timeOutSec=120tcpBufferBytes=4000000, 8000000

hosts=dct00.usatlas.bnl.gov, atlas-g01.bu.edu/data5/dq2-cache/test/, atlas.bu.edu/data5/dq2-cache/test/, umfs02.grid.umich.edu/atlas/data08/dq2/test/, umfs05.aglt2.org/atlas/data16/dq2/test/, dq2.aglt2.org/atlas/data15/mucal/test/, iut2-dc1.iu.edu/pnfs/iu.edu/data/test/, uct2-dc1.uchicago.edu/pnfs/uchicago.edu/data/ddm1/test/, gk01.swt2.uta.edu/ifs1/dq2_test/storageA/test/, tier2-02.ochep.ou.edu/ibrix/data/dq2-cache/test/, ouhep00.nhn.ou.edu/raid2/dq2-cache/test/, osgserv04.slac.stanford.edu/xrootd/atlas/dq2/tmp/

Page 12: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

MonALISA Monitoring Application

Java class that implements MonALISA's MonitoringModule Java class that implements MonALISA's MonitoringModule interface.interface.

Much simpler than controlling application (only 180 lines of Much simpler than controlling application (only 180 lines of code).code).

Parses log file produced by controlling application in the Parses log file produced by controlling application in the format (time, site name, module, host, statistic, value:format (time, site name, module, host, statistic, value: 1195623346000, BNL_ITB_Test1, Loadtest, bnl->uta(dct00->ndt),

network_m2m_avg_01s_08m, 6.42 (01s = 1 stream, 08m = TCP buffer size of 8 MB)

Data pulled by MonALISA server, which displays graph Data pulled by MonALISA server, which displays graph upon demand.upon demand.

Page 13: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Single-host Tests

• Too many graphs to show all, but two key graphs will be shown. For one stream:

Page 14: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Single-host Tests...

• For 12 streams (notice disk-to-disk improvement):

Page 15: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Multi-host tests

Using FTS to perform tests from BNL to Michigan initially and Using FTS to perform tests from BNL to Michigan initially and then to other Tier 2 sites.then to other Tier 2 sites.

Goal is sustained 200 MB/s disk-to-disk transfer for 10 minutes Goal is sustained 200 MB/s disk-to-disk transfer for 10 minutes from Tier 1 to each Tier 2. Can be in addition to existing traffic.from Tier 1 to each Tier 2. Can be in addition to existing traffic.

Trying to find optimum number of streams and TCP buffer size Trying to find optimum number of streams and TCP buffer size to use by finding optimum for single-host transfer between two to use by finding optimum for single-host transfer between two high performance machines. high performance machines. Low disk-to-disk, one-stream performance from BNL's thumper to Michigan's

Dell NAS of 2 MB/s whereas iperf mem-to-mem, one-stream gives 66 MB/s between same hosts (Nov 21, 07). Should this be higher for one stream?

Found that the more streams the higher the throughput, but cannot use too many especially with a high TCP buffer size or applications will crash.

Disk-to-disk throughput currently so low that a larger TCP buffer doesn't matter.

Page 16: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Multi-host Tests and Monitoring Monitoring using netflow graphs rather than Monalisa available at http://netmon.usatlas.bnl.gov/netflow/tier2.html. Some sites will likely require the addition of more storage pools and doors that are each TCP tuned to achieve the

goal.

Page 17: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Problems

Getting reliable testing results amidst existing trafficGetting reliable testing results amidst existing traffic Each test runs for a couple minutes and produces several samples, so hopefully a

window exists when the traffic is low during which the maximum is attained. The applications could be changed to output the maximum of the last few tests (tricky

to implement). Use dedicated Network Circuits: TeraPaths

Disk-to-disk bottleneckDisk-to-disk bottleneck Not sure if problem is the hardware or the storage software (e.g. dCache, Xrootd).

FUSE (Filesystem in Userspace), or filesystem in memory, which provides could help isolate storage software degradation. Bonnie could help isolate hardware degradation.

Is there anyone that could offer disk performance expertise? Discussed in Shawn McKee's presentation, 'Optimizing USATLAS Data Transfers.'".

Progress is happening slowly due to a lack of in-depth coordination, scheduling Progress is happening slowly due to a lack of in-depth coordination, scheduling difficulties, and a lack of manpower (Jay is using ~1/3 FTE). Too much on the difficulties, and a lack of manpower (Jay is using ~1/3 FTE). Too much on the agenda at the Computing Integration and Operations meeting to allow for in-agenda at the Computing Integration and Operations meeting to allow for in-depth coordination.depth coordination. Ideas for improvement

Page 18: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

TeraPaths and Its Role in Improving Network Connectivities between BNL and US ATLAS

Tier 2 sites. The problem: The problem: support efficient/reliable/predictable peta-scale data support efficient/reliable/predictable peta-scale data

movement in modern high-speed networksmovement in modern high-speed networks Multiple data flows with varying priority Default “best effort” network behavior can cause performance and service

disruption problems

Solution:Solution: enhance network functionality with QoS features to allow enhance network functionality with QoS features to allow prioritization and protection of data flowsprioritization and protection of data flows Treat network as a valuable resource Schedule network usage (how much bandwidth and when) Techniques: DiffServ (DSCP), PBR, MPLS tunnels, dynamic circuits (VLANs)

Collaboration with ESnet (OSCARS) and Internet 2 (DRAGON) to Collaboration with ESnet (OSCARS) and Internet 2 (DRAGON) to dynamically create end-to-end paths, and dynamically forward traffic into dynamically create end-to-end paths, and dynamically forward traffic into these paths. Software is being deployed to US ATLAS Tier 2 sites. these paths. Software is being deployed to US ATLAS Tier 2 sites. Option 1: Layer 3: MPLS tunnels (Umich and SLAC) Option 2: Layer 2: VLANs (BU, UMichi, demonstrated at SC’07)

Page 19: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Northeast Tier 2 Dynamic Network Links

Page 20: USATLAS Network/Storage and Load Testing Jay Packard Dantong Yu Brookhaven National Lab

Questions?Questions?