24
Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems Jason Cope [email protected]

Portable, Scalable, and High-Performance I/O · PDF fileI/O Forwarding on Massively Parallel Systems ... Round‐Robin (HBRR) iterates ... IOFSL Request Scheduling and Merging Results

Embed Size (px)

Citation preview

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems 

[email protected]

Computation and I/O Performance Imbalance

  Leadership‐classcomputa:onalscale:–  >100,000processes–  Mul:‐corearchitectures

–  Lightweightopera:ngsystemsoncomputenodes

  Leadership‐classstoragescale:–  >100servers–  Clusterfilesystems

–  Commercialstoragehardware

  Computeandstorageimbalanceincurrentleadership‐classsystemshindersapplica:onI/Operformance–  1GB/sofstoragethroughputforevery10TFofcomputa:on

performancegap

–  Thegaphasincreasedbyafactorof10inrecentyears2

DOE FastOS2 I/O Forwarding Scalability Layer (IOFSL) Project

Goal:Design,build,anddistributeascalable,unifiedhigh‐endcompu:ngI/OforwardingsoPwarelayerthatwouldbeadoptedbytheDOEOfficeofScienceandNNSA.–  Reducethenumberoffilesystemopera:onsthattheparallel

filesystemhandles–  Providefunc:onshippingatthefilesysteminterfacelevel

–  Offloadfilesystemfunc:onsfromsimpleorfullOSclientprocessestoavarietyoftargets

–  Supportmul:pleparallelfilesystemsolu:onsandnetworks

–  IntegratewithMPI‐IOandanyhardwarefeaturesdesignedtosupportefficientparallelI/O

3

Outline

  I/OForwardingScalabilityLayer(IOFSL)Overview  IOFSLDeploymentonArgonne’sIBMBlueGene/PSystems  IOFSLDeploymentonOakRidge’sCrayXTSystems

  Op:miza:onsandResults–  PipelininginIOFSL–  RequestSchedulingandMerginginIOFSL–  IOFSLRequestProcessing

  FutureWorkandSummary

4

HPC I/O Software Stack

!

!"#$%&'()*+('+,(%

! !"#$""%#&'#()*"#+,-#.'/&0)1"#).#"//232"$&#).#4'..256"

5

IOFSL Architecture

 Client- MPI‐IOusingZoidFSROMIOinterface

- POSIXusinglibsysioorFUSE

 Network- TransmitmessageusingBMIoverTCP/IP,MX,IB,Portals,andZOID

- MessagesencodedusingXDR

 Server- DelegatesIOtobackendfilesystemsusingna:vedriversorlibsysio

I/O Forwarding Server

System Network

Client Processing Node

ROMIO libsysio FUSE

ZOIDFS Client

Network API

Network API

ZOIDFS Server

PVFS POSIX libsysio

LustreGPFS

6

Argonne’s IBM Blue Gene/P Systems

7

FigureCourtesyofRobertRoss,ANL

IOFSL Deployment on Argonne’s IBM Blue Gene/P Systems

8

ION

Storage Server

Storage Server

Storage Server

10 Gbit Ethernet Network

ION

Compute Nodes Compute Nodes

TreeNetwork

TreeNetwork

PVFS2 servers GPFS servers

PVFS2 clients GPFS clients

IOFSL serversZOID servers

IOFSL clients,ZOID clients

Initial IOFSL Results on Argonne’s IBM Blue Gene/P Systems

0

100

200

300

400

500

600

700

800

4 16 64 256 1024 4096 16384 65536

Avg

Ba

nd

wid

th (

MiB

/s)

Message Size (KiB)

CIODIOFSL

0

100

200

300

400

500

600

700

800

900

4 16 64 256 1024 4096 16384 65536

Avg B

andw

idth

(M

iB/s

)

Message Size (KiB)

CIODIOFSL

9

IORRead IORWrite

Initial IOFSL Results on Argonne’s IBM Blue Gene/P Systems

10

0

100

200

300

400

500

600

700

800

64 128 256 512 1024

Avg B

andw

idth

(M

iB/s

)

Clients

CIOD, non-collective, t=8MIOFSL, TASK, t=8M

Oak Ridge’s Cray XT Systems

Enterprise Storagecontrollers and large

racks of disks are connectedvia InfiniBand.

48 DataDirect S2A9900controller pairs with

1 Tbyte drives and 4 InifiniBand

connections per pair

Storage Nodesrun parallel file system software and manage incoming FS traffic.

192 dual quad coreXeon servers with

16 Gbytes of RAM each

SION Networkprovides connectivity

between OLCF resources and

primarily carries storage traffic.

3000+ port 16 Gbit/secInfiniBand switch

complex

Lustre Router Nodesrun parallel file system

client software andforward I/O operations

from HPC clients.

192 (XT5) and 48 (XT4)one dual core

Opteron nodes with8 GB of RAM each

Jaguar XT5

Jaguar XT4

XT5 SeaStar2+ 3D Torus

9.6 Gbytes/sec

InfiniBand16 Gbit/sec

384 Gbytes/s

96Gbytes/s

384 Gbytes/s

384 Gbytes/s

Serial ATA3 Gbit/sec

366 Gbytes/s

Other Systems

(Viz, Clusters)

11

FigureCourtesyofGalenShipman,ORNL

IOFSL Deployment on Oak Ridge’s Cray XT Systems

Storage Server

Storage Server

Storage Server

16 Gbit Infiniband Network

Compute Nodes

Lustre servers

Lustre clients

IOFSL clients

IOFSL servers

TCPNetwork

LustreRouterNodes

12

Initial IOFSL Results on Oak Ridge’s Cray XT Systems

13

0

100

200

300

400

500

600

700

800

128 256 512 1024 2048 4096

Avg B

and

wid

th (

MiB

/s)

Clients

IOFSL, TASK, t=8MXT4, non-collective, t=8M

IOFSL Optimization #1: Pipeline Data Transfers

  Mo:va:on–  LimitsontheamountofmemoryavailableonI/Onodes

–  Limitsontheamountofpostednetworkopera:ons

–  Needtooverlapnetworkopera:onsandfilesystemopera:onforsustainedthroughput

  Solu:on:PipelinedatatransfersbetweentheIOFSLclientandserver–  Nego:atethepipelinetransferbuffersize–  Databuffersareaggregatedorsegmentedatthenego:ated

buffersize

–  Issuenetworktransferrequestsforeachpipelinebuffer–  Reformatpipelinebuffersintotheoriginalbuffersizes

  Currentlyserialandparallelpipelinemodes

14

Pipeline Data Transfer Results for Different IOFSL Server Configurations

15

0

50

100

150

200

250

256 512 1024 2048 4096 8192

Avg

Ba

nd

wid

th (

MiB

/s)

Pipeline Buffer Size (MiB)

Server Config #1 (SM Events)Server Config #2 (TASK Events)

IOFSL Optimization #2: Request Scheduling and Merging   RequestschedulingaggregatesseveralrequestsintoabulkIO

request

–  Reducesthenumberofclientaccessestothefilesystems–  Withpipelinetransfers,overlapsnetworkandstorageIO

accesses

  Twoschedulingmodessupported–  FIFOmodeaggregatesrequestsastheyarrive

–  Handle‐BasedRound‐Robin(HBRR)iteratesoverallac:vefilehandlestoaggregaterequests

  Requestmergingiden:fiesaggregatesnoncon:guousrequestsintocon:guousrequests–  BruteForcemodeiteratesoverallpendingrequests

–  IntervalTreemodecomparesrequeststhatareonsimilarranges

16

IOFSL Request Scheduling and Merging Results with the IOFSL GridFTP Driver

0

50

100

150

200

250

300

350

400

8 16 32 64 128

Avg B

andw

idth

(M

iB/s

)

Number of Clients

Requesting SchedulingNo Request Scheduling

17

MPI Application Application

FUSEMPI-IO

WAN

IOFSL Server

GridFTP Server

GridFTP Server GridFTP Server

High-Performance Storage System

Archival Storage System

IOFSL Optimization #3: Request Processing and Event Mode   Mul:‐ThreadedTaskMode

–  Newthreadforexecu:ngeachIOrequest–  Simpleimplementa:on

–  Threadconten:onandscalabilityissues  StateMachineMode

–  UseafixednumberofthreadsfromathreadpooltoexecuteIOrequests

–  DivideIOrequestsintosmallerunitsofwork

–  ThreadpoolsschedulesIOrequeststorunnon‐blockingunitsofwork(datamanipula:on,pipelinecalcula:ons,requestmerging)

–  Yieldexecu:onofIOrequestsonblockingresourceaccesses(networkcommunica:on,:merevents,memoryalloca:ons)

18

IOFSL Request Processing and Event Mode: Argonne’s IBM Blue Gene/P Results

19

0

200

400

600

800

1000

1200

1400

1600

64 128 256 512 1024

Avg B

andw

idth

(M

iB/s

)

Clients

CIOD, non-collective, t=8MIOFSL, TASK, t=8M

IOFSL, SM, t=8M

IOFSL Request Processing and Event Mode: Oak Ridge’s Cray XT4 Results

20

0

100

200

300

400

500

600

700

800

128 256 512 1024 2048 4096

Avg B

andw

idth

(M

iB/s

)

Clients

IOFSL, TASK, t=8MIOFSL, SM, t=8M

XT4, non-collective, t=8M

Current and Future Work

  ScalingandtuningofIOFSLonIBMBG/PandCrayXTsystems  Collabora:vecachinglayerbetweenIOFSLservers  Securityinfrastructure  Integra:ngIOFSLwithend‐to‐endI/Otracingandvisualiza:ontools

fortheNSFHECURAIOVIS/Jupiterproject

21

Project Participants and Support

  ArgonneNa,onalLaboratory:RobRoss,PeteBeckman,KamilIskra,DriesKimpe,JasonCope

  LosAlamosNa,onalLaboratory:JamesNunez,JohnBent,GaryGrider,SeanBlanchard,LatchesarIonkov,HughGreenberg

  OakRidgeNa,onalLaboratory:StevePoole,TerryJones  SandiaNa,onalLaboratories:LeeWard

  UniversityofTokyo:KazukiOhta,YutakaIshikawa

  TheIOFSLprojectissupportedbytheDOEOfficeofScienceandNNSA.

22

IOFSL Software Access, Documentation, and Links

  IOFSLProjectWebsite:hpp://www.iofsl.org

  IOFSLWikiandDevelopersWebsite:hpp://trac.mcs.anl.gov/projects/iofsl/wiki

  AccesstoIOFSLPublicgitRepository:gitclonehpp://www.mcs.anl.gov/research/projects/iofsl/gitiofsl

  Recentpublica:ons  K.Ohta,D.Kimpe,J.Cope,K.Iskra,R.Ross,andY.Ishikawa,"Op:miza:onTechniquesattheI/OForwardingLayer,”IEEECluster2010(toappear).

  D.Kimpe,J.Cope,K.Iskra,andR.Ross."GridsandHPC:NotasDifferentasyoumightthink,"Para2010mini‐symposiumonReal‐:meaccessandProcessingofLargeDataSets,April2010.

  23

Questions?

JasonCope

[email protected]

24