22
Amit Chourasia 1+ , Christopher Thompson 2+ & Steven Clark 2* + XSEDE ECRT Consultants * HUBZERO Consultants 1 San Diego Supercomputer Center, UC San Diego 2 Purdue University Research Team: Iowa State University Bhaskar Ganapathysubramanian (PI) David Ackerman (Post doc) Alex Lofquist (Graduate student) PolyRun - Polymer Microstructure ExploraVon HPC Gateway Project duraVon: Apr 2016 - Mar 2018

PolyRun - Polymer Microstructure Exploraon HPC Gateway

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

AmitChourasia1+,ChristopherThompson2+&StevenClark2* +XSEDEECRTConsultants *HUBZEROConsultants 1SanDiegoSupercomputerCenter,UCSanDiego 2PurdueUniversity

ResearchTeam:IowaStateUniversityBhaskarGanapathysubramanian(PI)

DavidAckerman(Postdoc)

AlexLofquist(Graduatestudent)

PolyRun-PolymerMicrostructureExploraVonHPCGateway

ProjectduraVon:Apr2016-Mar2018

PresentaVonOverview

1.  MoVvaVon&cauVon2.  Needandcapabilityassessment3.  GatewayDevelopment

a.  GatewayframeworkselecVonb.  DiaGriddevelopmentprocessc.  SubmissioninterfacerefinementandintegraVond.  VisualizaVonintegraVone.  Datamanagement

4.  Outcomes

MoVvaVonDevelopagatewayformaterialscienVsts– MinimizelearningcurveforusingPolyRun–  Eliminatesetuphurdles–  Providescalableplabormforusers–  EasierdatamanagementandvisualizaVon–  Provideconsistentandrichgraphicaluserinterface

Cau$onIsthegatewayreallyneeded?EmbraceoperaVonandmaintenancebeyondECRTeffort?

Needandcapabilityassessment

Exis$ngcapabili$es–  PolyRun-HPCcode–  Originalsubmissionsystem

Needs–  VaryingcomputaVonscenarios–  UseexisVnggraphicalsubmissionprototype–  Batchjobsubmissionsystem–  InteracVvevisualizaVon–  Usermanagement

PolyRun

HPCcodetostudyandaidindesignofpolymermaterials– Finiteelement– Codebase:C&MPI– Scalesupto1000sofcores– Typicallyrunwithparameterensemble

Originalsubmissionsystem

Asastand-alonedesktopapplicaVon,PolyRunsubmifedfromuser’sworkstaVonasindividualusersoverSSHtoresearchgroup’slocalcampusclusters.

PolyRun

CampusClusters

SSH(user)

ProposedSoluVon:DiaGridUsersinteractwithLinuxdesktopthroughVNC-likewebviewer.LinuxdesktopisavirtualcontainerinsideDiaGrid.–  LinuxenvironmentcanrunanydesktopapplicaVon

•  PolyRun&Debian-basedpackages–  DesktopenvironmentallowsinteracVonwithmulVpleprogramssimultaneously•  PolyRunandvisualizaVontoolsside-by-side.

–  ProgramsremaininDiaGridserver.•  Usersdonotdownloadorinstallanything.•  AllinteracVonisdoneviawebbrowser.

ProposedSoluVon:DiaGridHUBzeroprovides‘submit’systemformanagingjobsubmissions–  Submitdoestheactualworkofrunningremotesshor‘qsub’styleprograms,hidesqueues,allocaVons,etc.

–  Submitmanagesbasicdatatransferto/fromjob.–  SubmijngjobscentrallyasasingleDiaGridcommunityaccountmeansnewgatewayuserscanbemanagedlocallywithoutremoteaccountsneeded.

–  RequiresmodificaVonofPolyRuntoexecute‘submit’insteadofSSHandreadoutputof‘submit’forstatusupdates.

SelectedSoluVon:DiaGrid

AsaHUBzeroplaborm,theDiaGridsiteiscapableofmeeVngtheprojectneeds.

DiaGridDevelopmentProcessDevelopmentonHUBzeroplabormslikeDiaGridisaformalizedprocess:1.  Eachprogramcreatesanew“project.”2.  ResearcherscommitcodetoSVNrepository.3.  DeploycodetoexecuVonenvironmentandtestit.4.  ResearchersapprovecodeorcommitnewmodificaVons.5.  Onceapproved,projectis“published”allowingittoappearfor

DiaGridusersandbeexecuted.

Decideifthepublishedprogramispublicoravailabletospecificusergroups.Decideiftheprogramcodeisavailabletothepublicorkeptclosedsource.

PreparingandInstallingPolyRunResearchersuploadedexisVngPolyRunprogramintotheirprojectonDiaGridandmodifieditforHUBzero:–  RequestinstallPolyRundependenciesonDiagrid– ModifyPolyRuntouseSubmitsystemanddatamanagement

–  Exploreandextend•  VisualizaVoninteracVonviaVisIt•  DatasharingviaGlobus

–  DiaGridgenerallydoesnothostRAWHPCdata

Originalsubmissionsystem

Asastand-alonedesktopapplicaVon,PolyRunsubmifedfromuser’sworkstaVonasindividualusersoverSSHtoresearchgroup’slocalcampusclusters.

PolyRun

CampusClusters

SSH(user)

DiaGrid.org

DiaGridsubmissionsystemPolyRuninsideDiaGridwasmodifiedtocalllocal‘submit’command.‘Submit’daemonsendsjobstoremoteXSEDEclustersusingcommunityaccount,runninglocalsubmissioncommandsasappropriateforeachdesVnaVon.

RemoteClusters

WebFrontend

WebCMS&DBs

ExecuVonHost(LinuxVirtual

ContainerSessions)

UserSession

UserSession…

PolyRun

GlobusCLI

Submit(communityaccount)PolyRun

User’sStorage

User’sStorage

Refined/newsubmissioninterface

DevelopedinQT1.  SetconfiguraVon2.  Checkstatus3.  Viewresults4.  SearchAlsodevelopedaWebApp,butnotsupportedattheVme.

JobconfiguraVoninterface

Fig. 1. The job configuration screen with a sample of the QML code used to generate it.

system. In addition, for simulations that will run onHPC resources, there may be significant barriers to over-come in gaining access before starting any calculations.Minimizing set up can get users involved in answeringinteresting science questions quickly.

• Maintain the power of the HPC software: The polymerscience software was designed to support both individualsimulation runs as well as broad parameter sweep cam-paigns. The parameter sweeps are of particular interestto researchers as they allow exploration of the phasespace of potential polymer configurations. These sweepsare computationally demanding and require numeroussimulation jobs to be submitted at once. If the gatewaydoes not continue to support this capability, its valuewould be substantially reduced.

All of the development was targeted to meet one of theabove goals. The first approach was a Qt-based desktop appli-cation designed to manage creation of job scripts. This helpedaddress the first and last priorities and proved successful. Evenin a rough form, it was utilized by an undergraduate for hishonors research project [2], [3]. Cutting the learning curveby providing a familiar GUI allowed him to get started withthe interesting part of generating results without needing tolearn Linux terminal commands. Despite that success, he stillrequired help in getting the system set up and support fordata sweeps suffered due to issues with job management.Furthermore, despite being written with the cross-platform

Qt library, porting the application to Windows and Mac wasdeemed too time-consuming.

While promising, this approach did not meet the prioritieslisted above. The research group’s focus and expertise is withthe science of polymers and the development of the simulationsoftware. Expansion of the GUI to meet the given needs wasoutside of that area of expertise. To address these needs,assistance was requested through the Extended CollaborativeSupport Services [4] program of XSEDE [5]. This programprovided the required technical expertise in networking, visu-alization, and virtual machines.

Given limited development time, an appealing approachwas to deploy the existing Qt software within DiaGrid [6], aHubZero [7] based virtual machine run by Purdue University.This choice offered several advantages: 1) it preserved existingwork that had proven successful, 2) it enabled the use ofexisting, well tested and well supported HPC job submissionand tracking software [8], 3) ongoing development costs werereduced by only needing to support a single platform (theLinux based virtual machine), 4) user setup is limited tocreating an account on the DiaGrid system, 5) HPC jobscan be routed through an existing allocation if desired, 6)upgrades automatically propagate for all users, and 7) wideavailability via an externally supported host. Several of thesefeatures could be met in other ways, but the reduction inupfront development and ongoing support effort made this acompelling approach.

Jobstatusinterface

Fig. 2. Job status matrix for parameter sweep of jobs. The inset shows the code used to generate the simulation jobs in the status matrix.

III. IMPLEMENTATION ON VIRTUAL MACHINE

Each of our primary goals, along with the approach to meetit and challenges encountered, is outlined below.

A. Graphical Interface

As mentioned earlier, this is the foundation of the project.The original interface was designed as a stand-alone Qt ap-plication. The normal text based configuration was convertedto an input form using QML (Fig. 1). Related items are con-nected via scripting (e.g. changing from 2D to 3D simulationstriggers a change in the required number of simulation boxdimensions). Jobs are managed through a submission pageand a status page provides information on the progress. Oncecompleted, the results can be visualized using snapshots takenduring the simulation and plots of data values of interest.Representative images of the intermediate and final systemstates are generated on the computing resource using VisIt[9] with predefined views that illustrate the results of thesimulation. This enables a rapid evaluation of the resultsand permits the researcher to observe the progress of thesimulation. The raw data files may be retrieved from thecomputing resource through the user interface to enable further

processing. Porting the Qt code to the virtual machine requiredseveral non-standard Qt libraries to be installed, necessitatingupdates to the underlying DiaGrid system. A larger issue waslimited support for HTML widgets within the supported Qtenvironment. This prevented the use of HTML and JavaScript(which had been used in rapid prototyping of the user interfacecomponents) within the widgets used for layout and displayof information.

B. Minimal User SetupThe user is required to create an account on the DiaGrid

system, and needs to be given access to an HPC allocation.Beyond that, the virtual machine environment is already setup to run the system. Jobs are run on HPC resources using acommunity account with a user specfied HPC resource billing.This enables users to either utilize an existing allocation whichthey are granted access to, or to obtain an allocation on theirown for their project. The Diagrid system provides the accountauthorization.

C. Retain HPC CapabilitiesA critical component of the simulation software is the ability

to submit large batches of related jobs. Without this feature, it

Jobstatusinterface

Fig. 3. The results page allowing users to visualize the structures and timeevolution of data items of interest.

would not be possible to achieve one of the primary scientificgoals: the creation of phase diagrams from large numbersof simulations. A recent project utilizing this code requiredapproximately 5000 simulations jobs, with most grouped intobatches of 10-200 jobs that covered a span of parameters.To support this, a batch submission feature is included whichcan generate jobs with a range of parameters. Internally, themanagement of large batches of jobs is handled automaticallyby HubZero’s job submission middleware. The user is keptupdated on the progress of the batch jobs via a status pageas shown in Fig. 2. Job results can be individually inspectedvia the results page (Fig. 3). The inset shown on the statuspage is an example of the code used to create the list of batchjobs. While the goal of the project is to minimize the needfor users to write any sort of code, it proved unavoidablein this case. A fully graphical means of specifying jobswould require significant development time and be limited tothe capabilities provided by the developer – likely requiringadditional development effort every time a new use case isneeded. The code based generation avoids these issues. Tominimize the burden on users arising from this choice, thefollowing strategies are employed: 1) several detailed examplecases are available to users in the documentation, 2) theJavaScript language was chosen as it has a large number offreely available tutorials a user can reference, 3) a validationof code is performed with error messages which specify exactline numbers of problems, and 4) users are able to previewthe resulting job list prior to accepting it. These strategies areexpected to make it feasible for users to develop batch jobswith minimal difficulties.

D. Data Management

This proved to be the most complicated part of development.The fundamental challenge stemmed from the need for users tohave access to the large amounts of simulation data generatedon a system they, in general, will not have direct accessto. Data from a modest size simulation run could reach 100GB when keeping only final configuration data. Simulationson more complex polymer systems would have larger data

production. As shown in Fig. 3, the time evolution of thephysical structure is of interest. Storing intermediate timepoints could lead to 100 times more data than keeping onlythe end results. This is not an unmanageable amount of databut due to the size, it is not feasible to move all the datato the DiaGrid system. It is impractical and unnecessary toautomatically move the data to the user’s local system. Not allof this is data is ultimately necessary, but the determinationof which data is needed must be made by the user duringanalysis. The user will ultimately need access to the data thatis deemed relevant. The goal of this system is to help theuser filter through the data to decide what is relevant and thenmake that data available for further analysis. Our approachis two fold. First, we provide representative previews of alldata to enable the user to determine which data is needed forfuture analysis. This is intended to reduce the number of filesthat need to be transferred. Second, an interface is provided tothe Globus data management service [10], [11]. Data createdthrough the simulations is transferred to long term storageon the HPC cluster and automatically made accessible bythe GUI tool using the web-based Globus interface. Duringthe simulation setup, the user can provide a Globus emailthat the results will be shared with. The user is then able todownload the specific data files needed. Files can be transferedthrough Globus to a local system or to another HPC systemfor further post processing or visualization. File ownership ismaintained by the gateway, however access to specfic files isgoverned by Globus sharing permission which prevents usersfrom accessing data they did not create. Storage of data onthe HPC system will be governed by a data retention policythat may adapt over time as the gateway matures.

IV. CONCLUSION

This tool has been deployed on the DiaGrid system withinitial support for the Comet supercomputer at the San DiegoSupercomputer Center. The process of porting the Qt-basedapplication to the DiaGrid system presented several challengesdue to the nature of running on a virtual machine. The moti-vation of porting an existing program to the virtual machinewas practicality. Utilizing an existing, working system waspreferable to creating a new system. This was one of severaldesign trade offs needed to balance user experience and limiteddevelopment time. Referencing the key goals listed abovehelped guide the development process to ensure the resultinggateway met its intended function. Future work will focus ongrowth of a user community and expanding support for thedifferent polymer systems within the software.

ACKNOWLEDGMENTS

DMA, AL, and BG were partially supported by NSF GrantNo. DMR-1435587 and CMMI-1149365.

This work used the Extreme Science and Engineering Dis-covery Environment (XSEDE), which is supported by NationalScience Foundation grant number ACI-1548562 using theComet system at the San Diego Supercomputer Center.

VisualizaVonIdenVfyvisualizaVonscenariosandneedsFourscenarioswereevaluated–  IndependentbatchvisualizaVon–  SequenValcomputaVonandvisualizaVon–  CoupledcomputaVon&visualizaVon–  InteracVvevisualizaVononVMviaDiaGridStandardvisualizaVonscenarioswereimplementedProvidedbatchscripts,sessionfiles

Datamanagement

•  ResultsarecopiedtoDiaGridinusersaccount•  LargedataisleronCometandaccessibleviaGlobusSharing– Documenta$onhfps://www.seedme.org/node/182516

– NotavailableatotherXSEDESP’s

Projectoutcomes•  PolyRunGatewayreadyforusers•  VisualizaVontrainingofferedtotheresearchteam•  PublicaVons

–  AlecLofquist,DavidMAckerman,StevenClark,ChristopherThompson,AmitChourasia,BaskarGanapathysubramanian."PolyRun-PolymerMicrostructureExploraVonHPCGateway".PresentedattheGateways2018Conference,AusVn,TX,Sep25-27,2018

–  ScienceorientedpaperunderconsideraVonbyPI•  ExcellentcollaboraVonwithresearchteam&HubZeroteam

–  Researchteam:Overallvision+submissioninterface–  ECRTteam:ProjectfacilitaVon+HubZerointegraVon+visualizaVon–  HubZeroteam:Development+administraVon+operaVon

InfutureExtendrefinedsubmissionsystemtousecomputaVononlocalcluster

Lessonslearned

•  IntegraVngGlobusdatasharing•  DiaGridcapabiliVesknowhowandextensionstosupporttheGateway

•  Bi-weeklymeeVngscatalyzedworkprogress•  GatewaydevelopmentistediousandrequiresgreatdealofafenVontodetails–  PreciserequirementsgeneraVonwithcorrespondingimplementaVontakesseveraliteraVons

•  Mentoringresearchteamisnice,buttakeseffort

AcknowledgementsThismaterialisbaseduponworksupportedbytheNaVonalScienceFoundaVonunderGrantNo.DMR-1435587andCMMI-1149365.ThisworkusedtheExtremeScienceandEngineeringDiscoveryEnvironment(XSEDE),whichissupportedbyNaVonalScienceFoundaVongrantnumberACI-1548562usingtheCometsystemattheSanDiegoSupercomputerCenter.

"Anyopinions,findings,andconclusionsorrecommendaVonsexpressedinthismaterialarethoseoftheauthor(s)anddonotnecessarilyreflecttheviewsoftheNaVonalScienceFoundaVon."