Upload
colleen-beasley
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
OutlineIntroductionBackgroundMotivation and Research IssuesGridTorrent Framework ArchitectureMeasurements and AnalysisContributions and Future Works
2/02/2009 2
Data, Data, more Data• Computational science is changing to be data
intensive• Scientists are faced with mountains of data that
stem from three sources[1]:1. New scientific instruments data generation is
monotonic2. Simulations generates flood of data3. The Internet and computational Grid allow the
replication, creation, and recreation of more data[2]
2/02/2009 3
Data, Data, more Data (cont.)
Scientific discovery increasingly driven by data collection[3] Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityInternationally distributed collaborations
Data Intensive Science: 2000-2020 [4] Dominant factor: data growth (1 Petabyte = 1000 TB)
2000 ~0.5 Petabyte 2007 ~10 Petabytes 2013 ~100 Petabytes 2020 ~1000 Petabytes?
2/02/2009 4
Scientific Application Examples
Scientific applications generates petabytes of data are very diverse.
– Fusion power– Climate modeling – Astronomy– High-energy physics – Bioinformatics– Earthquake engineering
2/02/2009 5
Scientific Application Examples (cont.)
Some examples Climate modeling
Community Climate System Model and other simulation applications generates 1.5 petabytes/year
Bioinformatics The Pacific Northwest National Laboratory is building new Confocal
microscopes which will be generating 5 petabytes/year
High-energy physics The Large Hadron Collider (LHC) project at CERN will create 100
petabytes/year
2/02/2009 6
BackgroundSystems for transferring bulk
dataNetwork level solutionsSystem level solutionsApplication level solutions
2/02/2009 8
Background (cont.)Cost
Prevalence2/02/2009 9
Network Level SolutionsNetwork Attached Storage (NAS)
File-level storage system attached to traditional network
Use higher-level protocolsDoes not allow direct access to individual storageSimpler and more economical solution than SAN
Storage Area Network (SAN)Storage devices attached directly to LANUtilize low-level network protocols (Fiber Channels)Handle large data transfersProvide better performance
2/02/2009 10
System Level Solutions-Require modifications to
the operating systems of the machineThe network apparatusOr both
+ Yield very good performance- Expensive solutions- Not applicable to every systemGroup Transport Protocol for Lambda-Grids (GTP)
2/02/2009 11
2/02/200 12
Application Level Solutions+Use parallel streaming to improve performance+Tweak TCP buffer size to improve performance+Require no modifications to underlying systems+Inexpensive+Prevalent use+-May require auxiliary component for data
management-May not be as fast as Network/System level solutionsType of application solutions
TCP based solutionUDP based Solutions
TCP-Based Solutions+Harness the good features of TCP
+Reliability+-Built-in congestion control mechanism (TCP
Window)+Require no changes on existing system+Easy to implement+Prevalent use-Not suitable for real-time applicationsGridFTP, GridHTTP, bbFTP and bbcp
Use mainly FTP or HTTP as base protocol
2/02/2009 13
UDP-Based Solutions+Small segment head overhead (8 vs. 20 bytes)-Unreliable+-Require additional mechanism for reliability and
congestion control (at application level)+May overcome existing problems of TCP+May make UDP faster
-Integration with existing systems require some changes and efforts
SABUL, UDT, FOBS, RBUDP, Tsunami, and UFTPUtilized mainly rate-based control mechanism
2/02/2009 14
Auxiliary ComponentsUsed for file indexing and discoveryGridFTP utilizes the Replica Location Service
(RLS)Local Replica Catalogs (LRCs) Replica Location Indices (RLIs) LRCs send information about their state to RLIs
using soft state protocols
2/02/2009 15
Motivation and Research IssuesProblems of Existing SolutionsBuilt-on client/server model
Why not P2P?Utilize mainly FTP/HTTP type of protocols
Suffer from drawbacks of FTP/HTTPModification is very difficult
Require to build some vital services as separate modules
Use existing system resources inefficiently2/02/2009 16
Comparison of BitTorrent and GridTorrent’s ArchitectureBitTorrent GridTorrent Reason
P2P data-sharing protocol
P2P data-sharing protocol
No change
Simple HTTP Client SOA-based Tracker Client
To enable advanced operations exchange with WS-Tracker Service
- Task Manager To enable execution of advanced operations in Client such as remote sharing and ACL
Web Server based Tracker
Advanced SOA-based Tracker
To allow the system to build and to handle complex actions required by scientific community
- Security Manager To provide authentication and authorization mechanism
- Collaboration and Content Manager
To empower users to control access rights to their content and to start remote sharing, downloading processes and permit interactions between them
- Supporting Multiple Streams
To improve further data transmission performance2/02/2009 19
2/02/2009 20
Collaboration and Content ManagerAn Interface between users and the systemCapabilities:
Share contentBrowse contentDownload contentAdd/remove groupAdd/remove users for a particular content (Access
Right Controls)Add/remove users for a particular group (Access Right
Controls)Everything is metadata
2/02/2009 22
WS-Tracker Service component of GridTorrent Framework Architecture
2/02/2009 23
WS-Tracker ServiceThe communication hub
of the systemLoosely-coupled, flexible
and extensibleDeliver tasks to
GridTorrent clientsUpdate tasks status in
databaseStore and serve .torrent
files
2/02/2009 24
Database
WS-Tracker Service
GridTorrentClient
Get AvailableTasks
Ask for tasksDeliver
Task
Deliver .torrent file
Update Records
TaskA task is simply metadata (wrapped actions)
RequestResponsePeriodicNon-periodic
Instructs a GridTorrent client what to do with whomCreated by usersExchanged between WS-Tracker service and GridTorrent
client
2/02/2009 25
Task Format
2/02/2009 26
Tasks overviewNo Task Name Creator Source Destination Category
1 Task List Request GTFC GTFC WS-Tracker request, periodic
2 Share Content Request
User WS-Tracker
GTFC request, nonperiodic
3 Share Content Response
GTFC GTFC WS-Tracker Response, nonperiodic
4 Download ContentRequest
User WS-Tracker
GTFC Request, nonperiodic
5 Download Content Response
GTFC GTFC WS-Tracker response, periodic
6 ACL Request GTFC GTFC WS-Tracker request, periodic
7 ACL Response User WS-Tracker
GTFC response
8 Update Status GTFC GTFC WS-Tracker periodic
GridTorrent Client component of GridTorrent Framework Architecture
2/02/2009 28
GridTorrent ClientModular architecture
Provides extensibility and flexibility
Built-on P2P file sharing protocolEnables to utilize idle resources
efficientlyProvides adequate security
AuthenticationAuthorization
2/02/2009 29
Data Transfer Modules Management Modules
J avaTCP
Socket...
Task Manager
WS-TrackerClient
J ava CoG Kit
Security Manager
Torrent Data Sharing Logic
J avaPTCP
Socket
J ava WS Security
Dat
a S
har
ing
Alg
orit
hm
Lay
er
Cor
eM
odu
les
Lay
erS
ecu
rity
Lay
erG
rid
Inn
terf
ace
Utilizes regular and parallel stream connection (other transferring mechanism could be used)
PeerA’s Data Sharing Module
PeerB’s Security ModulePeerA’s Security Module
2/02/2009 30
PeerA starts authentication
process
PeerB handles PeerA’s request
Authorization successful?
Yes PeerA in
ACL?
PeerB gives PeerA data port number and passkey, also save passkey for
further use
Reject Connection
PeerA’s Data Sharing ModulePeerA connects received data port and sends passkey to start download process
PeerB starts data transferring process
Passkeyverificati
on
Yes
Yes No
No
No
Reject Connection
Security in GridTorrent ClientOnly security port number on which Security
Manager listens is publicly known to other peersEach peer has to be authenticated and authorized
(A&A) before starting download processAfter a successful A&A, they receive data port
number and passkeyPeers use passkey for second verification just before
download processIf everything is valid and successful, actual data
downloading is started
2/02/2009 31
Measurements and AnalysisThe set of benchmarks
PerformanceOverhead
Utilized PTCP transferring method for comparisonParallel streaming is one of the major performance
improvement methodsIt has similar structure with GridTorrent
Performed test-bed in these benchmarksLAN (Bloomington, IN-Indianapolis, IN)WAN (Bloomington, IN-Tallahassee, FL)
2/02/2009 32
Modeling of PTCP and GridTorrentPTCP with 3 streams GridTorrent with 3 sources
2/02/2009 33
LAN Test SetupPTCP GridTorrent
2/02/2009 34
Theoretical and Practical LimitsRTT = 0.30 msTheoretical Bandwidth = 1000 MbpsMaximum TCP Bandwidth = .9493*1000=949 Mbps
Ethernet’s Maximum Transmission Unit = 1500 ByteTCP’s Header = 20 ByteIP’s Header =20 ByteEthernet’s additional preamble = 38 ByteU=(1500-20-20)/(1500+38)=0.94928
Measured Bandwidth with Iperf = 857 MbpsServer side: Iperf -s -w 256k Client side: Iperf -c <hostname> -w 512k -P 50http://www.noc.ucf.edu/Tools/Iperf/
2/02/2009 35
LAN Test Result (RTT = 0.30 ms)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0 2 4 6 8 10 12 14 16
Number of Streams/Sources
Thro
ughput (M
bps)
PTCP GTorrent
2/02/2009 36
WAN Test-I SetupPTCP GridTorrent with regular socket
2/02/2009 37
Theoretical and Practical LimitsRTT = 50 msTheoretical Bandwidth = 1000 MbpsMaximum TCP Bandwidth = .9493*1000=949
MbpsMeasured Bandwidth with Iperf = 30.2 Mbps
Server side: Iperf -s -w 256k Client side: Iperf -c <hostname> -w 256k -P 50
2/02/2009 38
WAN Test-I Result (RTT = 50 ms)
2/02/2009 39
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
110.00
120.00
0 2 4 6 8 10 12 14 16
Number of Streams/Sources
Thro
ughput (M
bps)
PTCP GTorrent
WAN Test-II SetupPTCP GridTorrent with 4 parallel sockets
2/02/2009 40
WAN Test-II Result (RTT = 50 ms)
2/02/2009 41
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
110.00
120.00
130.00
140.00
150.00
0 2 4 6 8 10 12 14 16
Number of Streams/Sources
Thro
ughput (M
bps)
PTCP GTorrent
Evaluation of Test ResultsGridTorrent provides better or same performance on
WANPTCP reaches maximum data transfer speed at 15
streamsUtilizing PTCP in GridTorrent yields higher data transfer
rateTotal size of the overhead message is between 148-169
KB for transferring 300 MB fileScalability is not an issue due to bulk data transfer
characteristic
2/02/2009 42
Characteristics of Participation in Scientific CommunityNumber of participator is scale of 10,100, 1000sFully distributedTeam workCERN: The European Organization for Nuclear Research
The world's largest particle physics laboratorySupported by twenty European member statesCurrently the workplace of approximately 2,600 full-time
employeesSome 7,931 scientists and engineers representing 580 universities and research facilities80 nationalities
2/02/2009 43
Advantages of GridTorrentMore peers, more available servicesUnlike client/server model, mitigate loads on server with
more peersOptimal resources usage
Computing powerStorage spaceBandwidth
Very efficient for replica systemsP2P networks are more scalable than client/server modelReliable file transfer
Resume capability when data transfer interruptedThird-party transfer Disk allocation before actual data transfer
2/02/2009 44
2/02/2009 45
Transmission sequence matrix of PTCPTime (sec) S-C1 S-C2 S-C3 C1 C2 C3
1 N1 N1
2 N2 N1,N2
3 N3 N1,N2,N3
4 N1 N1
5 N2 N1,N2
6 N3 N1,N2,N3
7 N1 N1
8 N2 N1,N2
9 N3 N1,N2,N3
2/02/2009 46
Transmission sequence matrix of GridTorrent
Time (sec) S-C1 S-C2 S-C3 C1-C2 C2-C3 C1-C3 C1 C2 C3
1 N1 N1
2 N2 N1 N1 N2 N1
3 N3 N1 N1 N1,N2 N1,N3
4 N2 N2 N1,N2 N1,N2 N1,N2,N3
5 N3 N3 N1,N2,N3 N1,N2,N3
2/02/2009 47
ContributionsSystem research
A Collaborative framework with P2P based data moving technique Efficient, scalable and modular Integrating with SOA to increase modularity, flexibility and
extensibility Strategies for increasing performance and scalability Unification of many useful techniques such as reliable file transfer,
third-party transfer and disk allocation in a simple but efficient way Benchmarks to evaluate the GridTorrent performance
System software Designing and implementing a infrastructure consists of
GridTorrent client, WS-Tracker service, and Collaborative framework
2/02/2009 48
Future WorksUtilizing other high-performance low-level TCP or UDP
based data transfer protocols in data layerImproving existing P2P techniqueCertification handling service for different certificatesAdapting existing system to support dynamic (real-time)
contentDeveloping and deploying Intelligent source selection
algorithm into WS-Tracker ServiceSecurity
Security framework for WS-Tracker Service if necessaryTransforming Collaborative framework into portlets for
reusability2/02/2009 49
References1. Petascale computational systems, Bell, G.; Gray, J.;
Szalay, A. Computer Volume 39, Issue 1, Jan. 2006 Page(s): 110 – 112
2. Getting Up To Speed, The Future of Supercomputing, Graham, S.L. Snir, M., Patterson, C.A., (eds), NAE Press, 2004, ISBN 0-309-09502-6
3. Overview of Grid Computing, Ian Foster, http://www-fp.mcs.anl.gov/~foster/Talks/ResearchLibraryGroupGridsApril2002.ppt, last seen 2007
4. Science-Driven Network Requirements for Esnet, http:// www.es.net/ESnet4/Case-Study-Requirements-Update-With-Exec-Sum-v5.doc, last seen 2007
2/02/2009 51
Create MyFile.torrent
MyFile.torrent
2/02/2009 52
Upload MyFile.torrent
MyFile.torrent
2/02/2009 53
Join to Tracker
MyFile.torrent
2/02/2009 54
Find and obtain MyFile.torrent
MyFile.torrent
2/02/2009 55
Join Tracker Node
MyFile.torrent
MyFile.torrent
2/02/2009 56
Tracker Node replieswith list of peers = {Seed Node}
MyFile.torrent
MyFile.torrent
2/02/2009 57
Download pieces of content
MyFile.torrent
MyFile.torrent
MyFile.torrent