Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems
Computation and I/O Performance Imbalance
Leadership‐classcomputa:onalscale:– >100,000processes– Mul:‐corearchitectures
– Lightweightopera:ngsystemsoncomputenodes
Leadership‐classstoragescale:– >100servers– Clusterfilesystems
– Commercialstoragehardware
Computeandstorageimbalanceincurrentleadership‐classsystemshindersapplica:onI/Operformance– 1GB/sofstoragethroughputforevery10TFofcomputa:on
performancegap
– Thegaphasincreasedbyafactorof10inrecentyears2
DOE FastOS2 I/O Forwarding Scalability Layer (IOFSL) Project
Goal:Design,build,anddistributeascalable,unifiedhigh‐endcompu:ngI/OforwardingsoPwarelayerthatwouldbeadoptedbytheDOEOfficeofScienceandNNSA.– Reducethenumberoffilesystemopera:onsthattheparallel
filesystemhandles– Providefunc:onshippingatthefilesysteminterfacelevel
– Offloadfilesystemfunc:onsfromsimpleorfullOSclientprocessestoavarietyoftargets
– Supportmul:pleparallelfilesystemsolu:onsandnetworks
– IntegratewithMPI‐IOandanyhardwarefeaturesdesignedtosupportefficientparallelI/O
3
Outline
I/OForwardingScalabilityLayer(IOFSL)Overview IOFSLDeploymentonArgonne’sIBMBlueGene/PSystems IOFSLDeploymentonOakRidge’sCrayXTSystems
Op:miza:onsandResults– PipelininginIOFSL– RequestSchedulingandMerginginIOFSL– IOFSLRequestProcessing
FutureWorkandSummary
4
HPC I/O Software Stack
!
!"#$%&'()*+('+,(%
! !"#$""%#&'#()*"#+,-#.'/&0)1"#).#"//232"$&#).#4'..256"
5
IOFSL Architecture
Client- MPI‐IOusingZoidFSROMIOinterface
- POSIXusinglibsysioorFUSE
Network- TransmitmessageusingBMIoverTCP/IP,MX,IB,Portals,andZOID
- MessagesencodedusingXDR
Server- DelegatesIOtobackendfilesystemsusingna:vedriversorlibsysio
I/O Forwarding Server
System Network
Client Processing Node
ROMIO libsysio FUSE
ZOIDFS Client
Network API
Network API
ZOIDFS Server
PVFS POSIX libsysio
LustreGPFS
6
Argonne’s IBM Blue Gene/P Systems
7
FigureCourtesyofRobertRoss,ANL
IOFSL Deployment on Argonne’s IBM Blue Gene/P Systems
8
ION
Storage Server
Storage Server
Storage Server
10 Gbit Ethernet Network
ION
Compute Nodes Compute Nodes
TreeNetwork
TreeNetwork
PVFS2 servers GPFS servers
PVFS2 clients GPFS clients
IOFSL serversZOID servers
IOFSL clients,ZOID clients
Initial IOFSL Results on Argonne’s IBM Blue Gene/P Systems
0
100
200
300
400
500
600
700
800
4 16 64 256 1024 4096 16384 65536
Avg
Ba
nd
wid
th (
MiB
/s)
Message Size (KiB)
CIODIOFSL
0
100
200
300
400
500
600
700
800
900
4 16 64 256 1024 4096 16384 65536
Avg B
andw
idth
(M
iB/s
)
Message Size (KiB)
CIODIOFSL
9
IORRead IORWrite
Initial IOFSL Results on Argonne’s IBM Blue Gene/P Systems
10
0
100
200
300
400
500
600
700
800
64 128 256 512 1024
Avg B
andw
idth
(M
iB/s
)
Clients
CIOD, non-collective, t=8MIOFSL, TASK, t=8M
Oak Ridge’s Cray XT Systems
Enterprise Storagecontrollers and large
racks of disks are connectedvia InfiniBand.
48 DataDirect S2A9900controller pairs with
1 Tbyte drives and 4 InifiniBand
connections per pair
Storage Nodesrun parallel file system software and manage incoming FS traffic.
192 dual quad coreXeon servers with
16 Gbytes of RAM each
SION Networkprovides connectivity
between OLCF resources and
primarily carries storage traffic.
3000+ port 16 Gbit/secInfiniBand switch
complex
Lustre Router Nodesrun parallel file system
client software andforward I/O operations
from HPC clients.
192 (XT5) and 48 (XT4)one dual core
Opteron nodes with8 GB of RAM each
Jaguar XT5
Jaguar XT4
XT5 SeaStar2+ 3D Torus
9.6 Gbytes/sec
InfiniBand16 Gbit/sec
384 Gbytes/s
96Gbytes/s
384 Gbytes/s
384 Gbytes/s
Serial ATA3 Gbit/sec
366 Gbytes/s
Other Systems
(Viz, Clusters)
11
FigureCourtesyofGalenShipman,ORNL
IOFSL Deployment on Oak Ridge’s Cray XT Systems
Storage Server
Storage Server
Storage Server
16 Gbit Infiniband Network
Compute Nodes
Lustre servers
Lustre clients
IOFSL clients
IOFSL servers
TCPNetwork
LustreRouterNodes
12
Initial IOFSL Results on Oak Ridge’s Cray XT Systems
13
0
100
200
300
400
500
600
700
800
128 256 512 1024 2048 4096
Avg B
and
wid
th (
MiB
/s)
Clients
IOFSL, TASK, t=8MXT4, non-collective, t=8M
IOFSL Optimization #1: Pipeline Data Transfers
Mo:va:on– LimitsontheamountofmemoryavailableonI/Onodes
– Limitsontheamountofpostednetworkopera:ons
– Needtooverlapnetworkopera:onsandfilesystemopera:onforsustainedthroughput
Solu:on:PipelinedatatransfersbetweentheIOFSLclientandserver– Nego:atethepipelinetransferbuffersize– Databuffersareaggregatedorsegmentedatthenego:ated
buffersize
– Issuenetworktransferrequestsforeachpipelinebuffer– Reformatpipelinebuffersintotheoriginalbuffersizes
Currentlyserialandparallelpipelinemodes
14
Pipeline Data Transfer Results for Different IOFSL Server Configurations
15
0
50
100
150
200
250
256 512 1024 2048 4096 8192
Avg
Ba
nd
wid
th (
MiB
/s)
Pipeline Buffer Size (MiB)
Server Config #1 (SM Events)Server Config #2 (TASK Events)
IOFSL Optimization #2: Request Scheduling and Merging RequestschedulingaggregatesseveralrequestsintoabulkIO
request
– Reducesthenumberofclientaccessestothefilesystems– Withpipelinetransfers,overlapsnetworkandstorageIO
accesses
Twoschedulingmodessupported– FIFOmodeaggregatesrequestsastheyarrive
– Handle‐BasedRound‐Robin(HBRR)iteratesoverallac:vefilehandlestoaggregaterequests
Requestmergingiden:fiesaggregatesnoncon:guousrequestsintocon:guousrequests– BruteForcemodeiteratesoverallpendingrequests
– IntervalTreemodecomparesrequeststhatareonsimilarranges
16
IOFSL Request Scheduling and Merging Results with the IOFSL GridFTP Driver
0
50
100
150
200
250
300
350
400
8 16 32 64 128
Avg B
andw
idth
(M
iB/s
)
Number of Clients
Requesting SchedulingNo Request Scheduling
17
MPI Application Application
FUSEMPI-IO
WAN
IOFSL Server
GridFTP Server
GridFTP Server GridFTP Server
High-Performance Storage System
Archival Storage System
IOFSL Optimization #3: Request Processing and Event Mode Mul:‐ThreadedTaskMode
– Newthreadforexecu:ngeachIOrequest– Simpleimplementa:on
– Threadconten:onandscalabilityissues StateMachineMode
– UseafixednumberofthreadsfromathreadpooltoexecuteIOrequests
– DivideIOrequestsintosmallerunitsofwork
– ThreadpoolsschedulesIOrequeststorunnon‐blockingunitsofwork(datamanipula:on,pipelinecalcula:ons,requestmerging)
– Yieldexecu:onofIOrequestsonblockingresourceaccesses(networkcommunica:on,:merevents,memoryalloca:ons)
18
IOFSL Request Processing and Event Mode: Argonne’s IBM Blue Gene/P Results
19
0
200
400
600
800
1000
1200
1400
1600
64 128 256 512 1024
Avg B
andw
idth
(M
iB/s
)
Clients
CIOD, non-collective, t=8MIOFSL, TASK, t=8M
IOFSL, SM, t=8M
IOFSL Request Processing and Event Mode: Oak Ridge’s Cray XT4 Results
20
0
100
200
300
400
500
600
700
800
128 256 512 1024 2048 4096
Avg B
andw
idth
(M
iB/s
)
Clients
IOFSL, TASK, t=8MIOFSL, SM, t=8M
XT4, non-collective, t=8M
Current and Future Work
ScalingandtuningofIOFSLonIBMBG/PandCrayXTsystems Collabora:vecachinglayerbetweenIOFSLservers Securityinfrastructure Integra:ngIOFSLwithend‐to‐endI/Otracingandvisualiza:ontools
fortheNSFHECURAIOVIS/Jupiterproject
21
Project Participants and Support
ArgonneNa,onalLaboratory:RobRoss,PeteBeckman,KamilIskra,DriesKimpe,JasonCope
LosAlamosNa,onalLaboratory:JamesNunez,JohnBent,GaryGrider,SeanBlanchard,LatchesarIonkov,HughGreenberg
OakRidgeNa,onalLaboratory:StevePoole,TerryJones SandiaNa,onalLaboratories:LeeWard
UniversityofTokyo:KazukiOhta,YutakaIshikawa
TheIOFSLprojectissupportedbytheDOEOfficeofScienceandNNSA.
22
IOFSL Software Access, Documentation, and Links
IOFSLProjectWebsite:hpp://www.iofsl.org
IOFSLWikiandDevelopersWebsite:hpp://trac.mcs.anl.gov/projects/iofsl/wiki
AccesstoIOFSLPublicgitRepository:gitclonehpp://www.mcs.anl.gov/research/projects/iofsl/gitiofsl
Recentpublica:ons K.Ohta,D.Kimpe,J.Cope,K.Iskra,R.Ross,andY.Ishikawa,"Op:miza:onTechniquesattheI/OForwardingLayer,”IEEECluster2010(toappear).
D.Kimpe,J.Cope,K.Iskra,andR.Ross."GridsandHPC:NotasDifferentasyoumightthink,"Para2010mini‐symposiumonReal‐:meaccessandProcessingofLargeDataSets,April2010.
23