21
Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Embed Size (px)

Citation preview

Page 1: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Secure, Collaborative, Web Service enabled and Bittorrent Inspired

High-speed Scientific Data Transfer Framework

Page 2: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Introduction• Scientific applications generates terabytes or

even petabytes.– High-energy physics

• Cern is a funded jointly by 20 European countries, with 3000 staff supporting 6500 researchers in 35 nations

• The large hadron collider (LHC) project will create 15 petabytes per year of data

– Fusion power– Climate modeling – Earthquake engineering– Astronomy– Biology

Page 3: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Data Intensive Science: 2000-2015[1]

• Scientific discovery increasingly driven by data collection – Computationally intensive analyses– Massive data collections– Data distributed across networks of varying capability– Internationally distributed collaborations

• Dominant factor: data growth (1 Petabyte = 1000 TB)– 2000 ~0.5 Petabyte– 2005 ~10 Petabytes– 2010 ~100 Petabytes– 2015 ~1000 Petabytes?

Page 4: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Requirements for Scientific Data Transfer

• Transferring scientific data over large-scale requires– efficient, – high-performance, – reliable, – secure– policy-aware management– optimum use of resources

(CPU, storage, network bandwidth)

Page 5: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Background

• There are successful attempts to meet the above requirements as– GridFTP– GridFTPXIO– GridHTTP– TeraGrid Copy (TGCP)– The Replica Location Service (RLS)– gLite

Page 6: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

GridFTP

• Extension of the standard FTP protocol• Reliable, • secure • high performance • Efficient• the de facto standard for transferring data in

many Grid projects• However, GridFTP does not offer a web

service interface.

Page 7: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

GridFTP (cont.)

• Additional features supported by the GridFTP protocol– Grid Security Infrastructures (GSI) and Kerberos support– Support for reliable and restartable data transfer: restart

transfers from point of failure when failures occurred– Partial file transfer: regions of a file transfer.– Parallel data transfer: multiple TCP streams between two

network endpoints to improve bandwidth.– Third-party control of data transfer: the ability to control

transfers between storage servers from remote (third-party) server.

Page 8: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

GridHTTP

• Allow large (gigabyte) files to be transferred at optimal speeds using HTTP

• Does not deviate from existing HTTP standards, • But describes how to use existing headers and

methods to produce an encrypted data stream.• Support bulk data transfers via unencrypted HTTP, • Support authentication and authorization with the

usual grid credentials over HTTP.

Page 9: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

GridFTPXIO• The Globus eXtensible Input/Output (XIO)

System• provides an abstraction layer to transport

protocols. • enables different I/O problems to be

presented uniformly as a simple open/close/read/write (OCRW) interface.

• a support framework for developing communication protocols.

• an interface that enables an existing application written with XIO to access their hardware.

• primary usage scenarios – Independence from the Transport Control

Protocol – Ease of Adding GridFTP Support to Third-Party

Applications– Ease of Providing GridFTP Access to Data

Storage

Page 10: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

TeraGrid Copy (TGCP)• TeraGrid Copy (TGCP) solution

includes three main components: – GridFTP Service – RFT Service– TGCP shell script

• In the striped configuration,– GridFTP service runs on several

nodes of a cluster– the data to be transferred is

partitioned among the nodes– each node may use several parallel

streams to attain the maximum performance

Page 11: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

TGCP (cont.)

• The tgcp script can use the globus-url-copy tool– (A) in either third-party

transfer mode – (B) in conventional

GridFTP client mode

Page 12: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

TGCP (cont.)

• RFT Service will be used to manage the transfer.

• adds additional reliability to the transfer request

• transfer will be completed, if failure occurred during the transfer.

Page 13: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

The Replica Location Service (RLS)• provides a framework for tracking the physical locations of

data that has been replicated. • maps logical names to physical names. • Replication of data items can • reduce access latency, • improve data locality, • increase robustness, scalability and performance for

distributed applications. • does not operate in isolation, • used with other components like the Reliable File Transfer

service, GridFTP, the Metadata Catalog Service, and etc.

Page 14: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

RLS (cont.)

• The current RLS implementation has the following features. – Local Replica Catalogs (LRCs) – Replica Location Indices (RLIs) – LRCs send information about their state to RLIs using soft

state protocols. – Optional "Bloom Filter" compression can be used to

summarize the contents of the LRC. – The current RLS implementation maintains static

information about the LRCs and RLIs participating in the distributed system.

Page 15: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Our proposal: GridTorrent

• We are proposing a new distributed file peer-to-peer protocol in scientific data in an acceptable speed

• Similar to (GridFTP) redefining of FTP protocol to adjust it using in scientific data transfer

• There are many studies show that Bittorrent can be used for scientific applications

Page 16: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

GridTorrent Architecture

Web Browser

GridTorrent WS Client

GridTorrentClient

RegularWeb Server

User

Database Server

GridTorrent Web Service Gateway

1. Register to system

2. Return credential

3. Publish/Browse Content

Record UserSettings

4. Connect to GT WS Gateway

Web Browser

GridTorrent WS Client

GridTorrentClient

User

Exc

hang

e da

ta

Update Information

Page 17: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Advantages

• Saves resources by taking advantage of the unused upload capacity of downloaders.– CPU– Network Bandwidth– Disk

• Reliable• Jobs can be started and stopped using web interface• Can be deployed under any system• Secure

Page 18: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Initial Test results• File size is around 185MB• LAN test result:

– Sources were on gridfarm machines (Bloomington, IN) and client was on complexity machine (Indianapolis, IN)

– Transfer speed 71 Mbps. – PTCP transfer speed is around 80 Mbps with the same situation. – bandwidth usage of each source:

• WAN test result:• Like LAN tests, sources were on gridfarm machines (Bloomington, IN)

and client was on pipeline3 machine (San Diego, CA). • Transfer speed is 17 Mbps • PTCP transfer speed is around 27 Mbps with the same situation.

seed1  Seed2 Seed3 seed4

 44MB 53Mb 47Mb 41MB

seed1  Seed2 Seed3 seed4

 52MB 45Mb 43Mb 48MB

Page 19: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Why Bittorrent?• Alternative Peer to Peer Protocols

– FastTrack– Gnutella– eDonkey– Direct Connect– Ares

• Why BitTorrent?– Better bandwidth utilization– Never before speeds.– Up to 7 MB/s from the Internet.– Limit free riding – tit-for-tat– Limit leech attack – coupling upload & download– Spurious files not propagated– Ability to resume a download

Page 20: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Why Bittorrent?

• Bittorrent proved that it is suitable for distributing very large files.

• There are many companies using Bittorrent as distributing protocol– Amazon S3– Microsoft’s Avalanche (inspired by Bittorrent)– Blizzard (Game production company)– Movie studios

Page 21: Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Research Issues• Current Bittorrent protocol is designed for actual network

environment• Modifications needed to provide pure scientific data transfer

– modification on message format and frequency– parallel TCP/UDP– UDP – Web Service oriented client

• Requirements needed to provide pure scientific data transfer– Security– Content access management– Searching capability