Upload
vuongduong
View
216
Download
0
Embed Size (px)
Citation preview
PRP-FIONA workshop, Bozeman, August 2, 2018PRP-FIONA workshop, Bozeman, August 2, 2018
Basics of Data Transfer
Shawfeng Dong
Principal Cyberinfrastructure Engineer
University of California, Santa Cruz
PRP-FIONA workshop, Bozeman, August 2, 2018
Outline
• Data Transfer Nodes (DTNs)
• Limitation of Common File Transfer Tools
• Fast Data Transfer Utilities
PRP-FIONA workshop, Bozeman, August 2, 2018
Data Transfer Nodes (DTNs)
• DTNs are typically Linux servers built with high-quality components and configured specifically for wide area data transfer
• Types of DTNs ( https://fasterdata.es.net/science-dmz/science-dmz-architecture/ )oFat DTNs
oThin DTNs
oClustered DTNs
PRP-FIONA workshop, Bozeman, August 2, 2018
Fat DTN in Simple Science DMZ
https://fasterdata.es.net/science-dmz/science-dmz-architecture/
PRP-FIONA workshop, Bozeman, August 2, 2018
Thin DTNs in HPC Facility
https://fasterdata.es.net/science-dmz/science-dmz-architecture/
PRP-FIONA workshop, Bozeman, August 2, 2018
Clustered DTNs
https://fasterdata.es.net/science-dmz/science-dmz-architecture/
PRP-FIONA workshop, Bozeman, August 2, 2018
Data Transfer Tools
• Common File Transfer ToolsoFTP, scp/sftp, rsync, wget, curl, etc.
• Fast Data Transfer UtilitiesoGridFTP
oGlobus
obbcp
obbFTP
oFDT (Fast Data Transfer)
oEtc.
PRP-FIONA workshop, Bozeman, August 2, 2018
Selecting a File Transfer Tool
• Security Model1. anonymous: (e.g.: FTP, HTTP) anyone can access the data
2. simple password: (e.g.: FTP, HTTP) most sites no longer allow this method since the password can be easily captured
3. password encrypted: (e.g.: bbcp, bbftp, GridFTP, Globus, FDT) control channel is encrypted, but data is unencrypted
4. everything encrypted: (e.g.: scp, sftp, rsync over ssh, GridFTP, HTTPS-based web server) both control and data channels are encrypted
• Support for parallel data stream
https://fasterdata.es.net/data-transfer-tools/background/
PRP-FIONA workshop, Bozeman, August 2, 2018
Limitation of Common File Transfer Tools
• In order to obtain maximum throughput over a high-speed WAN, one needs use a file transfer tool that includes support for parallel data streams.
• Unfortunately almost none of the commonly used file transfer tools support parallel data streams.oFTP, scp/sftp, rsync, wget, curl, etc.
PRP-FIONA workshop, Bozeman, August 2, 2018
scp and sftp
• The openssh versions of scpand sftp have a built in 1 MB buffer (previously only 64 KB in openssh older than version 4.7) that severely limits performance on a WAN!
• The HPN patch from PSC makes it possible to optimize single stream performance on a WAN.
https://fasterdata.es.net/data-transfer-tools/
PRP-FIONA workshop, Bozeman, August 2, 2018
GridFTP
• Two channel protocol like FTP
• Control Channelo Communication link (TCP) over which
commands and responses flow
o Low bandwidth; encrypted and integrity protected by default
• Data Channelo Communication link(s) over which the actual
data of interest flows
o Multiple simultaneous TCP streams
o High bandwidth; authentication by default; encryption and integrity protection optional
http://www.mcs.anl.gov/~kettimut/talks/gridnets09.pdf
PRP-FIONA workshop, Bozeman, August 2, 2018
GridFTP Control Channels
• GSIFTPoUses Grid Security Infrastructure (GSI) for authentication, and optionally
for data channel encryption
oX.509 certificates based
oAllows third party transfer
oURL: gsiftp://hostname/path/to/remote/file
• SSHFTP (GridFTP-over-SSH)oUses SSH for authentication
oURL: sshftp://hostname/path/to/remote/file
oSee Esnet's GridFTP Quick Start Guide on how to enable sshftp
PRP-FIONA workshop, Bozeman, August 2, 2018
GridFTP Clients
• The Globus Toolkit provides a GridFTP CLI client called globus-url-copy
• Globus Onlinea fast, reliable file transfer service that makes it easy for users to move data between two GridFTP servers or between a GridFTP server and a user’s machine (Windows, Mac or Linux), using Globus Connect Personal.
• Globus Command Line Interface called globus
PRP-FIONA workshop, Bozeman, August 2, 2018
BBCP
• Written by Andy Hanushevsky at SLAC as a tool for the BaBarcollaboration
• "Short of access to a GridFTP site, bbcp appears to be the fastest, most convenient single-node method for transferring data", Harry Mangalam of UC Irvine
• Does not require a remote server running, invoked by ssh
• Uses SSH for authentication, similar to SSHFTP (GridFTP-over-SSH)
• Supports parallel TCP data streams
PRP-FIONA workshop, Bozeman, August 2, 2018
BBCP Installation
• For simplicity, download a precompiled bbcp binary executable and place it in /usr/local/bin:
cd /usr/local/bin/
wget http://www.slac.stanford.edu/~abh/bbcp/bin/amd64_rhel60/bbcp
chmod +x bbcp
• On server host, append 2 lines to /etc/services:bbcpfirst 60000/tcp # bbcp
bbcplast 60100/tcp # bbcp
• Open inbound TCP ports 60000 — 60100
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #1: Testing your DTN against an ESnet DTN• ESnet has deployed a set of test hosts for high-speed disk-to-
disk testing
• Running globus-url-copy on your DTN:# make sure you can connect to server
globus-url-copy -list ftp://sunn-dtn.es.net:2811/data1/
# copy 10G file using 4 parallel streams
globus-url-copy -vb -fast -p 4 \
ftp://sunn-dtn.es.net:2811/data1/10G.dat \
file:///tmp/test.out
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #1: Testing your DTN against an ESnet DTN• Using Globus Online, if
oYou've installed Globus Connect Server on your DTN;
oConfigured a Globus Authentication/Authorization method for your DTN: MyProxy, OAuth, or CILogon;
oCreated an Endpoint for your DTN
• Globus Quick Start Guide at ESnet
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #2: Moving data from NERSC to UCSC• Case scenario:
o I have a valid account at NERSC
o I wanted to move some of my simulation data from NERSC to UCSC
• There are four data transfer nodes deployed at NERSC which allow interactive use: dtn0[1-4].nersc.gov
• Each NERSC DTN has four 10-gigabit ethernet links for transfers over the network and two FDR IB connections to the filesystem
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #2: Moving data from NERSC to UCSC• Using BBCP to copy data from NERSC
# download a 100GB data file from NERSC to our DTN at UCSC
bbcp -z -P 10 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \
[email protected]::/global/cscratch1/sd/shawdong/100GB.dat \
/bigdata/dong/100GB-bbcp.dat
• We got a very remarkable data transfer speed of 994.MB/s!
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #2: Moving data from NERSC to UCSC• Using BBCP to copy data from NERSC
# download a 100GB data file (on Edison's Lustre filesystem) from NERSC
bbcp -z -P 8 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \
[email protected]::/global/cscratch1/sd/shawdong/100GB.dat \
/bigdata/dong/100GB-bbcp.dat
# download one of Eli Dart's test datasets (on GPFS)
bbcp -z -P 8 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \
[email protected]::/global/project/projectdirs/mpccc/dart/test-data/50G.dat \
/bigdata/dong/50G-bbcp.dat
• We got a very remarkable data transfer speed of 1.1GB/s!
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #2: Moving data from NERSC to UCSC• Using GridFTP command line tools to copy data from NERSC is
slightly more involved (SSHFTP is not enabled at NERSC)# initialize my MyProxy certificate
myproxy-logon -l shawdong -s nerscca.nersc.gov
# download one of Eli Dart's datasets (on GPFS)
globus-url-copy -list
globus-url-copy -vb -fast -p 8 \
gsiftp://[email protected]/global/project/projectdirs/mpccc/dart/test-data/50G.dat \
file:/bigdata/dong/50GB-gsiftp.dat
• We got a whopping average transfer speed of 1.487 GB/s, with peak speed of almost 2GB/s!
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #2: Moving data from NERSC to UCSC• NERSC recommends using
Globus Online to move significant amounts of data between NERSC and other sites
• The endpoint name of NERSC DTNS is "NERSC DTN"
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #3: Moving data from OLCF to UCSCCase scenario:
• One of my collaborators had 2TB of simulation data at OLCF
• We wanted to move the data from OLCF to UCSC to perform analysis and visualization
• My OLCF account was expired
• My collaborator didn't have an account on the DTN at UCSC; and we didn't want to create a new account just for this one-time data transfer
PRP-FIONA workshop, Bozeman, August 2, 2018
Example #3: Moving data from OLCF to UCSCSolution: using bbcp1. My collaborator sent me his SSH public key2. I appended his SSH public key to my ~/.ssh/authorized_keys on the
DTN at UCSC3. If I am paranoid, I could further restrict the authorized key 4. On one of OLCF DTNs, my collaborator moved the data to UCSC
# move a file to UCSC DTN bbcp -i /path/to/his/private/key -P 8 data.tar \
[email protected]:/mnt/pulpos/cross/data.tar# move a directory to UCSC DTNbbcp -i /path/to/his/private/key -P 8 -r -A some_folder \
[email protected]:/mnt/pulpos/cross/some_folder