21
High Performance GridFTP Transport of Earth System Grid (ESG) Data 1 Center for Enabling Distributed Petascale Science

High Performance GridFTP Transport of Earth System Grid (ESG) Data

  • Upload
    trey

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

High Performance GridFTP Transport of Earth System Grid (ESG) Data. Center for Enabling Distributed Petascale Science. Description. - PowerPoint PPT Presentation

Citation preview

Page 1: High Performance GridFTP Transport of Earth System Grid (ESG) Data

High Performance GridFTP Transport of Earth System Grid

(ESG) Data

1

Center for Enabling Distributed Petascale Science

Page 2: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Description Transfer 10TBs of climate data into the SC09 show floor from

three sites – the Argonne Leadership Computing Facility

(ACLF), the National Energy Research Scientific Computing

center (NERSC) and LLNL.

As the data arrives at its destination in the University of Utah’s

SC09 booth, it will be stored on disks provided by the Data

Direct Networks.

Data will be processed using climate data analysis and

visualization tool and then publicly displayed along with

graphs depicting the characteristics of the transfer.

Page 3: High Performance GridFTP Transport of Earth System Grid (ESG) Data

End-to-End Flow

Page 4: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Scientific Purpose Climate data is moved in this challenge

Climate is a discipline that is highly collaborative, and its datasets are

distributed across the globe.

An interesting feature of climate data is that the actual file size is not very

large compared to that of other sciences.

Climate researchers, however, need to move hundreds or thousands of files

in a single transfer.

Volume of data to be moved across the network is massive.

Multiple TB of data from Climate Research Program Coupled Model

Intercomparison Project, Phase 3 (CMIP3) is moved This data was used in the Intergovernmental Panel on Climate Change

(IPCC) Fourth Assessment Report (AR4)

This data is used in anticipation of the approaching IPCC Fifth Assessment

Report (AR5)

Page 5: High Performance GridFTP Transport of Earth System Grid (ESG) Data

How Computing and Network map into Climate Modeling Efforts

Each Climate

Modeling task maps

onto these strategic

objectives

from:

Page 6: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Network Challenges in ESG

Independent gateways federating metadata and users Individual data nodes responsible for publishing services Designed for model output data sets

Page 7: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Technical Approach and Methods Transfers initiated by the climate community can be between

a client and a server or between two remote servers initiated

by the user from a third machine. GridFTP and other data movement tools developed by Center

for Enabling Distributed Petascale Science (CEDPS) are ideal

for these types of transfers GridFTP is optimized for high-bandwidth, wide area networks. Globus implementation of GridFTP provides a software suite

optimized for a broad range of data access applications Including bulk file transfer and data extraction from complex

storage systems.

Page 8: High Performance GridFTP Transport of Earth System Grid (ESG) Data

GridFTP Advantages Performance - Orders of magnitude performance improvements

over standard FTP Uses parallel TCP streams and non-TCP protocols such as UDT coordinated transfer using multiple computers at source and destination.

Secure - GridFTP supports the PKI/X.509 based Grid Security Infrastructure (GSI) – simple options to encrypt/integrity check data

GridFTP also supports SSH security

Robust - Restart markers allow interrupted transfers to restart with

minimal delay overhead.

Extensible – Clear abstractions to interface with various transport

protocols and with different storage systems Completely shields user from the complexities of underlying storage systems

including tape archves such as HPSS

Page 9: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Key GridFTP Features used in the Challenge

Concurrency and Pipelining Allows the client to simultaneously maintain multiple

outstanding, unacknowledged transfer commands Greatly improves performance lots of small files transfers

File Request 1

File Request 2

File Request 3

DATA 1

DATA 2

DATA 3

ACK 1

ACK 2

ACK 3

File Request 1

File Request 2File Request 3DATA 1

DATA 2

DATA 3

ACK 1

ACK 2

ACK 3

Traditional Pipelining

Page 10: High Performance GridFTP Transport of Earth System Grid (ESG) Data

GridFTP Clients and Netlogger

Three different GridFTP clients are used to move the 10 TB

data set for the challenge Globus.org – hosted data movement service

BDM – Bulk Data Mover

Globus-url-copy

Netlogger – used to monitor transfers and troubleshoot

problems Distributed performance analysis and troubleshooting

Standard log format and best practices

Log collection tools

Log parser

Data analysis tools

Page 11: High Performance GridFTP Transport of Earth System Grid (ESG) Data

What is the Globus.org Data Movement Service (a.k.a. DataKoa)?

A new Globus data movement service The same vision, but an updated implementation Hosted Domain-independent, multi-use

Enables scientists to focus on domain-specific work Manages technology failures Sends notifications of interesting events

Enables non-experts to easily and efficiently move data No operations overhead Minimal user-side software installation User interfaces require no special expertise Built-in data transport configuration expertise

Page 12: High Performance GridFTP Transport of Earth System Grid (ESG) Data

GridFTP

Server A

GridFTP

Server B

Globus.orgGlobus.org

LaptopLaptop

Globus.org Data Movement Service

The client connects to Globus.org and submits requests. It can then

disappear from the network

Globus.org orchestrates the transfer between

GridFTP servers.

Page 13: High Performance GridFTP Transport of Earth System Grid (ESG) Data

What is BDM? BDM: Bulk Data Mover

Scalable data movement management tool Calls GridFTP file transfers

Designed for climate community (Earth System Grid) needs Efficient and reliable transfer management from user’s point of

view Simple to install and maintain as a novice user Scalable to large in volume Scalable to large in number of files Efficient handling on extreme variance in file sizes Scalable to future performance expectations

Network performance improvements – 100Gbps and beyond Storage performance improvements – distributed, parallel, SSD,

etc. Multiple transfer protocol support

Able to work with other applications with similar needs Information

http://sdm.lbl.gov/bdm Contact: Dean Williams [email protected]

Page 14: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Globus-url-copy Commonly used command line scriptable GridFTP

client

Supports various transfer optimizations including

parallel TCP streams, concurrent file transfers

New features Fault tolerant

Store state in a file

Restarting globus-url-copy transfers only the remaining data

Associate multiple physical endpoints with single logical

endpoint

Load balance across all the physical endpoints

9/15/09 Argonne National Laboratory

Page 15: High Performance GridFTP Transport of Earth System Grid (ESG) Data

NetLogger BWC Deployment

ALCFALCF LLNLLLNL NERSCNERSC

LBNLLBNL

GridFTP

servers

GridFTP

servers

GridFTP

servers

GridFTP

servers

GridFTP

servers

GridFTP

servers

SC09 Show FloorSC09 Show Floor

DataData

Logs NetLogger

DB

NetLogger

DB

Plots on the

web

Page 16: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Data Direct Networks Silicon Storage Architecture (S2A)

Page 17: High Performance GridFTP Transport of Earth System Grid (ESG) Data

ESnet Science Data Network Good network is as important having the right tools

and applications. needed a good network that would move these datasets at

high speeds to the convention center

ESnet was the perfect fit to pull data from national labs

Science Data Networks (SDN) and On-Demand

Secure Circuit and Advance Reservation System

(OSCARS) guarantees that we will have a dedicated circuit on the

network for the duration of the challenge

don’t have to compete with anyone else for bandwidth

9/15/09 Argonne National Laboratory

Page 18: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Data Analysis and Visualization The data were analyzed using the Climate Data Analysis

Tools (CDAT) developed by Program for Climate Model

Diagnosis and Intercomparison (PCMDI)

CDAT is a suite of interrelated diagnostic software tools Flexible, portable, adaptable, efficient, easy-to-use,

shareable and free

Capable of operating in a distributed environment

3D Interface provided by the ViSUS plugin developed at

the SCI Institute at University of Utah and LLNL Streaming and progressive data flow

Integrated analysis and illustration tools

9/15/09 Argonne National Laboratory

Page 19: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Data Analysis and Visualization

Full Video is available at http://www.sci.utah.edu/~pascucci/tmp/climate_video/

Page 20: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Overarching Research Agenda Climate community is expecting to generate petabytes of

simulated data for analysis and future climate predictions.

In the next few years, climate researchers will be moving

terabytes of data to collaborators across the globe for IPCC

Fifth Assessment Report (AR5), which will be published in

2013. Moving large amounts of data seamlessly, reliably and

quickly is required to make sense of the enormous AR5

climate data set Help scientists understand climatic imbalances and the

potential impacts of future climate change scenarios.

9/15/09 Argonne National Laboratory

Page 21: High Performance GridFTP Transport of Earth System Grid (ESG) Data

Overarching Research Agenda This demonstration highlights the tools and services

that will help them transport their data quickly and

reliably

Hope that the lessons learned in this experiment will

help us to do this better

Improve the transport and monitoring tools further

and help not only the climate researchers but also

other researchers in getting their science done

faster than before

9/15/09 Argonne National Laboratory