36
Collaborative Framework for High-Performance P2P-based Data Transfer in Scientific Computing Ali Kaplan [email protected] Advisor: Prof. Geoffrey C. Fox 2/02/2009 1

Collaborative Framework for High-Performance P2P-based Data Transfer in Scientific Computing

  • Upload
    elata

  • View
    65

  • Download
    2

Embed Size (px)

DESCRIPTION

Collaborative Framework for High-Performance P2P-based Data Transfer in Scientific Computing. Ali Kaplan [email protected] Advisor: Prof. Geoffrey C. Fox. Outline. Introduction Background Motivation and Research Issues GridTorrent Framework Architecture Measurements and Analysis - PowerPoint PPT Presentation

Citation preview

Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Collaborative Framework for High-Performance P2P-based Data Transfer in Scientific ComputingAli [email protected]: Prof. Geoffrey C. Fox2/02/20091OutlineIntroductionBackgroundMotivation and Research IssuesGridTorrent Framework ArchitectureMeasurements and AnalysisContributions and Future Works

2/02/20092Data, Data, more DataComputational science is changing to be data intensiveScientists are faced with mountains of data that stem from three sources[1]:New scientific instruments double their output every year or soSimulations generates flood of dataThe Internet and computational Grid allow the replication, creation, and recreation of more data[2]

2/02/20093The Internet and and computational Grid that makes all these archives accessible to anyone anywhere, allowing the replication, creation, and recreation of more data3Data, Data, more Data (cont.)Scientific discovery increasingly driven by data collection[3] Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityInternationally distributed collaborationsData Intensive Science: 2000-2015Dominant factor: data growth (1 Petabyte = 1000 TB)2000~0.5 Petabyte2005~10 Petabytes2010~100 Petabytes2015~1000 Petabytes?

2/02/20094Scientific community has large set of distributed data

4Scientific Application ExamplesScientific applications generates petabytes of data are very diverse.Fusion powerClimate modeling Earthquake engineeringAstronomyBioinformaticsHigh-energy physics 2/02/20095Scientific Application Examples (cont.)Some examplesClimate modeling Community Climate System Model and other simulation applications generates 1.5 petabytes/yearBioinformaticsThe Pacific Northwest National Laboratory is building new Confocal microscopes which will be generating 5 petabytes/yearHigh-energy physics The large hadron collider (LHC) project at CERN will create 15 petabytes/year2/02/20096The National Center for Atmospheric Research (NCAR)

Cern is a funded jointly by 20 European countries, with 3000 staff supporting 6500 researchers in 35 nations

6Background

Systems for transferring bulk dataNetwork level solutionsSystem level solutionsApplication level solutions2/02/20097Super-Computers must be balanced system, not just CPU farms but also petascale IO and networking arrays.7Background (cont.)CostPrevalence2/02/20098System Level Solutions-Require modifications to the operating systems of the machineThe network apparatusOr both+ Yield very good performance- Expensive solutions- Not applicable to every systemGroup Transport Protocol for Lambda-Grids (GTP)2/02/20099Network Level SolutionsNetwork Attached Storage (NAS)File-level storage system attached to traditional networkUse higher-level protocolsDoes not allow direct access to individual storageSimpler and more economical solution than SANStorage Area Network (SAN)Storage devices attached directly to LANUtilize low-level network protocols (Fibre Channels)Handle large data transfersProvide better performance

2/02/20091010Application Level Solutions+Use parallel streaming to improve performance+Require no modifications to underlying systems+Inexpensive+Broader use+-May require auxiliary component for data management-May not be as fast as Network/System level solutionsType of application solutionsTCP based solutionUDP based Solutions2/02/200911TCP-Based Solutions+Harness the good features of TCP+Reliability+-Built-in congestion control mechanism (TCP Window)+Require no changes on existing system+Easy to implement+Broader use-Not suitable for real-time applicationsGridFTP, GridHTTP, bbFTP and bbcpUse mainly FTP or HTTP as base protocol

2/02/200912UDP-Based Solutions+Small segment head overhead (8 vs. 20 bytes)-Unreliable+-Require additional mechanism for reliability and congestion control (at application level)+May overcome existing problems of TCP+May make UDP faster-Integration with existing systems require some changes and effortsSABUL, UDT, FOBS, RBUDP, Tsunami, and UFTPUtilized mainly rate-based control mechanism

2/02/200913Auxiliary ComponentsUsed for file indexing and discoveryGridFTP utilizes the Replica Location Service (RLS)Local Replica Catalogs (LRCs) Replica Location Indices (RLIs) LRCs send information about their state to RLIs using soft state protocols Optional "Bloom Filter" compression can be used to summarize the contents of the LRC. The current RLS implementation maintains static information about the LRCs and RLIs participating in the distributed system

2/02/200914Motivation and Research IssuesProblems of Existing SolutionsBuilt-on client/server modelWhy not P2P?Utilize mainly FTP/HTTP type of protocolsSuffer from drawbacks of FTP/HTTPModification is very difficultRequire to build some vital services as separate modulesUse existing system resources inefficiently

2/02/200915Motivation and Research Issues (cont.)If a P2P model can be solutionWhich P2P can be the right model?What additional features does it require?Collaborative FrameworkP2P Client communication hubModerate securityIs it scalable?How is the performance of it?What is the overhead of it?How is it flexible and extensible?

2/02/200916

GridTorrent Framework Architecture2/02/200917Collaboration and Content ManagerAn Interface between users and the systemCapabilities:Share contentBrowse contentDownload contentAdd/remove groupAdd/remove users for a particular content (Access Right Controls)Add/remove users for a particular group (Access Right Controls)Everything is metadata

2/02/200918GridTorrent Framework Architecture

2/02/200919WS-Tracker ServiceThe communication hub of the systemLoosely-coupled, flexible and extensibleDeliver tasks to GridTorrent clientsUpdate tasks status in databaseStore and serve .torrent files2/02/200920

DatabaseWS-Tracker ServiceGridTorrentClientGet AvailableTasksAsk for tasksDeliver Task

Deliver .torrent fileUpdate RecordsGridTorrent Framework Architecture

2/02/200921TaskA task is simply metadata (wrapped actions)RequestResponsePeriodicNon-periodicInstructs a GridTorrent client what to do with whomCreated by usersExchanged between WS-Tracker service and GridTorrent client2/02/200922Task Format2/02/200923

GridTorrent ClientModular architectureProvides extensibility and flexibilityBuilt-on P2P file sharing protocolEnables to utilize idle resources efficientlyProvides adequate securityAuthenticationAuthorization2/02/200924

Utilizes regular and parallel stream connection (other transferring mechanism could be used)24Security in GridTorrent ClientOnly Security Module port is publicEach peer has to be authenticated and authorized (A&A) before starting download processAfter a successful A&A, they receive data port number and passkeyPeers use passkey for second verification just before download processIf everything is valid and successful, actual data downloading is started2/02/200925PeerAs Data Sharing ModulePeerBs Security ModulePeerAs Security ModuleSecurity in GridTorrent Client-I2/02/200926PeerA starts authentication processPeerB handles PeerAs requestAuthorization successful?YesPeerA in ACL?PeerB gives PeerA data port number and passkey, also save passkey for further use

Reject ConnectionPeerAs Data Sharing ModulePeerA connects received data port and sends passkey to start download processPeerB starts data transferring processPasskeyverificationYesYesNoNoNo Measurements and AnalysisThe set of benchmarksPerformanceOverheadUtilized PTCP transferring method for comparison Performed test-bed in these benchmarksLAN (Bloomington, IN-Indianapolis, IN)WAN (Bloomington, IN-Tallahassee, FL)2/02/200927LAN Test SetupPTCPGridTorrent2/02/200928

LAN Test Result2/02/200929

WAN Test-I SetupPTCPGridTorrent2/02/200930

WAN Test-I Result2/02/200931

WAN Test-II SetupPTCPGridTorrent2/02/200932

WAN Test-II Result2/02/200933

Evaluation of Test ResultsGridTorrent provides better or same performance on WANPTCP reaches maximum data transfer speed at 15 streamsUtilizing PTCP in GridTorrent yields higher data transfer rateTotal size of the overhead message is between 148-169 KB for transferring 300 MB fileScalability is not an issue due to bulk data transfer characteristic

2/02/200934ContributionsSystem researchA Collaborative framework with P2P based data moving techniqueEfficient, scalable and modularIntegrating with SOA to increase modularity, flexibility and extensibilityStrategies for increasing performance and scalabilityUnification of many useful techniques such as reliable file transfer, third-party transfer and disk allocation in a simple but efficient wayBenchmarks to evaluate the GridTorrent performanceSystem softwareDesigning and implementing a infrastructure consists of GridTorrent client, WS-Tracker service, and Collaborative framework2/02/200935Future WorksUtilizing other high-performance low-level TCP or UDP based data transfer protocols in data layerImproving existing P2P techniqueAdapting existing system to support dynamic(real-time) contentDeveloping and deploying Intelligent source selection algorithm into WS-Tracker ServiceSecuritySecurity framework for WS-Tracker Service if necessaryTransforming Collaborative framework into portlets for reusability2/02/200936TextTextData Transfer ModulesManagement ModulesJavaPTCPSocketJava WS SecurityData Sharing Algorithm LayerCoreModules LayerGrid InnterfaceJavaTCP SocketSecurity Layer...Task ManagerWS-TrackerClientJava CoG KitSecurity ManagerTorrent Data Sharing Logic