24
PRP-FIONA workshop, Bozeman, August 2, 2018 PRP-FIONA workshop, Bozeman, August 2, 2018 Basics of Data Transfer Shawfeng Dong Principal Cyberinfrastructure Engineer University of California, Santa Cruz

Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

Embed Size (px)

Citation preview

Page 1: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018PRP-FIONA workshop, Bozeman, August 2, 2018

Basics of Data Transfer

Shawfeng Dong

Principal Cyberinfrastructure Engineer

University of California, Santa Cruz

Page 2: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Outline

• Data Transfer Nodes (DTNs)

• Limitation of Common File Transfer Tools

• Fast Data Transfer Utilities

Page 3: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Data Transfer Nodes (DTNs)

• DTNs are typically Linux servers built with high-quality components and configured specifically for wide area data transfer

• Types of DTNs ( https://fasterdata.es.net/science-dmz/science-dmz-architecture/ )oFat DTNs

oThin DTNs

oClustered DTNs

Page 4: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Fat DTN in Simple Science DMZ

https://fasterdata.es.net/science-dmz/science-dmz-architecture/

Page 5: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Thin DTNs in HPC Facility

https://fasterdata.es.net/science-dmz/science-dmz-architecture/

Page 6: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Clustered DTNs

https://fasterdata.es.net/science-dmz/science-dmz-architecture/

Page 7: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Data Transfer Tools

• Common File Transfer ToolsoFTP, scp/sftp, rsync, wget, curl, etc.

• Fast Data Transfer UtilitiesoGridFTP

oGlobus

obbcp

obbFTP

oFDT (Fast Data Transfer)

oEtc.

Page 8: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Selecting a File Transfer Tool

• Security Model1. anonymous: (e.g.: FTP, HTTP) anyone can access the data

2. simple password: (e.g.: FTP, HTTP) most sites no longer allow this method since the password can be easily captured

3. password encrypted: (e.g.: bbcp, bbftp, GridFTP, Globus, FDT) control channel is encrypted, but data is unencrypted

4. everything encrypted: (e.g.: scp, sftp, rsync over ssh, GridFTP, HTTPS-based web server) both control and data channels are encrypted

• Support for parallel data stream

https://fasterdata.es.net/data-transfer-tools/background/

Page 9: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Limitation of Common File Transfer Tools

• In order to obtain maximum throughput over a high-speed WAN, one needs use a file transfer tool that includes support for parallel data streams.

• Unfortunately almost none of the commonly used file transfer tools support parallel data streams.oFTP, scp/sftp, rsync, wget, curl, etc.

Page 10: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

scp and sftp

• The openssh versions of scpand sftp have a built in 1 MB buffer (previously only 64 KB in openssh older than version 4.7) that severely limits performance on a WAN!

• The HPN patch from PSC makes it possible to optimize single stream performance on a WAN.

https://fasterdata.es.net/data-transfer-tools/

Page 11: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

GridFTP

• Two channel protocol like FTP

• Control Channelo Communication link (TCP) over which

commands and responses flow

o Low bandwidth; encrypted and integrity protected by default

• Data Channelo Communication link(s) over which the actual

data of interest flows

o Multiple simultaneous TCP streams

o High bandwidth; authentication by default; encryption and integrity protection optional

http://www.mcs.anl.gov/~kettimut/talks/gridnets09.pdf

Page 12: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

GridFTP Control Channels

• GSIFTPoUses Grid Security Infrastructure (GSI) for authentication, and optionally

for data channel encryption

oX.509 certificates based

oAllows third party transfer

oURL: gsiftp://hostname/path/to/remote/file

• SSHFTP (GridFTP-over-SSH)oUses SSH for authentication

oURL: sshftp://hostname/path/to/remote/file

oSee Esnet's GridFTP Quick Start Guide on how to enable sshftp

Page 13: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

GridFTP Clients

• The Globus Toolkit provides a GridFTP CLI client called globus-url-copy

• Globus Onlinea fast, reliable file transfer service that makes it easy for users to move data between two GridFTP servers or between a GridFTP server and a user’s machine (Windows, Mac or Linux), using Globus Connect Personal.

• Globus Command Line Interface called globus

Page 14: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

BBCP

• Written by Andy Hanushevsky at SLAC as a tool for the BaBarcollaboration

• "Short of access to a GridFTP site, bbcp appears to be the fastest, most convenient single-node method for transferring data", Harry Mangalam of UC Irvine

• Does not require a remote server running, invoked by ssh

• Uses SSH for authentication, similar to SSHFTP (GridFTP-over-SSH)

• Supports parallel TCP data streams

Page 15: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

BBCP Installation

• For simplicity, download a precompiled bbcp binary executable and place it in /usr/local/bin:

cd /usr/local/bin/

wget http://www.slac.stanford.edu/~abh/bbcp/bin/amd64_rhel60/bbcp

chmod +x bbcp

• On server host, append 2 lines to /etc/services:bbcpfirst 60000/tcp # bbcp

bbcplast 60100/tcp # bbcp

• Open inbound TCP ports 60000 — 60100

Page 16: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #1: Testing your DTN against an ESnet DTN• ESnet has deployed a set of test hosts for high-speed disk-to-

disk testing

• Running globus-url-copy on your DTN:# make sure you can connect to server

globus-url-copy -list ftp://sunn-dtn.es.net:2811/data1/

# copy 10G file using 4 parallel streams

globus-url-copy -vb -fast -p 4 \

ftp://sunn-dtn.es.net:2811/data1/10G.dat \

file:///tmp/test.out

Page 17: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #1: Testing your DTN against an ESnet DTN• Using Globus Online, if

oYou've installed Globus Connect Server on your DTN;

oConfigured a Globus Authentication/Authorization method for your DTN: MyProxy, OAuth, or CILogon;

oCreated an Endpoint for your DTN

• Globus Quick Start Guide at ESnet

Page 18: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #2: Moving data from NERSC to UCSC• Case scenario:

o I have a valid account at NERSC

o I wanted to move some of my simulation data from NERSC to UCSC

• There are four data transfer nodes deployed at NERSC which allow interactive use: dtn0[1-4].nersc.gov

• Each NERSC DTN has four 10-gigabit ethernet links for transfers over the network and two FDR IB connections to the filesystem

Page 19: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #2: Moving data from NERSC to UCSC• Using BBCP to copy data from NERSC

# download a 100GB data file from NERSC to our DTN at UCSC

bbcp -z -P 10 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \

[email protected]::/global/cscratch1/sd/shawdong/100GB.dat \

/bigdata/dong/100GB-bbcp.dat

• We got a very remarkable data transfer speed of 994.MB/s!

Page 20: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #2: Moving data from NERSC to UCSC• Using BBCP to copy data from NERSC

# download a 100GB data file (on Edison's Lustre filesystem) from NERSC

bbcp -z -P 8 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \

[email protected]::/global/cscratch1/sd/shawdong/100GB.dat \

/bigdata/dong/100GB-bbcp.dat

# download one of Eli Dart's test datasets (on GPFS)

bbcp -z -P 8 -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \

[email protected]::/global/project/projectdirs/mpccc/dart/test-data/50G.dat \

/bigdata/dong/50G-bbcp.dat

• We got a very remarkable data transfer speed of 1.1GB/s!

Page 21: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #2: Moving data from NERSC to UCSC• Using GridFTP command line tools to copy data from NERSC is

slightly more involved (SSHFTP is not enabled at NERSC)# initialize my MyProxy certificate

myproxy-logon -l shawdong -s nerscca.nersc.gov

# download one of Eli Dart's datasets (on GPFS)

globus-url-copy -list

globus-url-copy -vb -fast -p 8 \

gsiftp://[email protected]/global/project/projectdirs/mpccc/dart/test-data/50G.dat \

file:/bigdata/dong/50GB-gsiftp.dat

• We got a whopping average transfer speed of 1.487 GB/s, with peak speed of almost 2GB/s!

Page 22: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #2: Moving data from NERSC to UCSC• NERSC recommends using

Globus Online to move significant amounts of data between NERSC and other sites

• The endpoint name of NERSC DTNS is "NERSC DTN"

Page 23: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #3: Moving data from OLCF to UCSCCase scenario:

• One of my collaborators had 2TB of simulation data at OLCF

• We wanted to move the data from OLCF to UCSC to perform analysis and visualization

• My OLCF account was expired

• My collaborator didn't have an account on the DTN at UCSC; and we didn't want to create a new account just for this one-time data transfer

Page 24: Basics of Data Transfer - bozeman-fiona-workshop.ucsd.edu · PRP-FIONA workshop, Bozeman, August 2, 2018 GridFTP •Two channel protocol like FTP •Control Channel oCommunication

PRP-FIONA workshop, Bozeman, August 2, 2018

Example #3: Moving data from OLCF to UCSCSolution: using bbcp1. My collaborator sent me his SSH public key2. I appended his SSH public key to my ~/.ssh/authorized_keys on the

DTN at UCSC3. If I am paranoid, I could further restrict the authorized key 4. On one of OLCF DTNs, my collaborator moved the data to UCSC

# move a file to UCSC DTN bbcp -i /path/to/his/private/key -P 8 data.tar \

[email protected]:/mnt/pulpos/cross/data.tar# move a directory to UCSC DTNbbcp -i /path/to/his/private/key -P 8 -r -A some_folder \

[email protected]:/mnt/pulpos/cross/some_folder