Upload
milo-austin
View
219
Download
0
Embed Size (px)
Citation preview
MultiJob PanDA Pilot
Oleynik Danila28/05/2015
MultiJob PanDA pilot 2
Overview
• Initial PanDA pilot concept & HPC• Motivation• PanDA Pilot workflow at nutshell• MultiJob Pilot in details
MultiJob PanDA pilot 3
Initial PanDA pilot concept & HPC
• Pilot definition: «The Panda pilot is an execution environment used to prepare the computing element, request the actual payload (a production or user analysis job), execute it, and clean up when the payload has finished»
• One of HPC limitation is restricted number of launched jobs (pilots) under one account. (usually less than 10), but one job may occupy a lot of resources (tens – hundreds of nodes)
• For the moment ATLAS have no payloads which may be executed on more than one node (MPI)
MultiJob PanDA pilot 4
Motivation
• No way to get MPI ATLAS production payloads quickly • HPC resources should be used as much efficient as
possible. There is no gain to launch just only few panda jobs simultaneously, if much more resources available– Potential outcome from machine like Titan compatible
with, at least, Tier2 center • Possible solution, which allow significant increase
efficiency of usage of HPC is launching of set of PanDA jobs in assemble as one MPI job.
MultiJob PanDA pilot 5
PanDA Pilot workflow at nutshell
• There are next basic steps in pilot workflow:– Retrieve job information– Setup environment– StgaeIn input data– Execute payloads– StageOut output data and logs
• During execution, pilot monitor available disk resources, output files and updates PanDA server with status of PanDA job.
MultiJob PanDA pilot 6
MultiJob Pilot
• Current realization of MultiJob pilot implemented with same workflow and framework as regular PanDA pilot
• Most of core components and basic procedures of regular pilot were modified to serve multiple jobs with different states
• Procedures for intercommunication between runJob and Monitor process was slightly redesigned (without changing of technology)
• Current version was designed as “proof of concept”
MultiJob PanDA pilot 7
MultiJob Pilot. Requesting jobs.
• For the moment there is no method on PanDA server to retrieve set of jobs
• Set of jobs collects from server in cycle one by one. One request takes ~1 sec. so this will not scale good for big amount of jobs
• It’s important to collect jobs only from one task in bunch, to avoid mess with environment setup later
• Number of requested jobs fitted with available backfill resources
MultiJob PanDA pilot 8
MultiJob Pilot. Environment setup and verification
• Environment setup in most of cases is specific for experiment.
• Organized for each job in set • Optimized through reduction in the number of
repeating identical checks
MultiJob PanDA pilot 9
MultiJob Pilot. StageIn
• Optimized through reduction of number of remote stagein in case data already copied locally (for other job in set)– This simple optimization give significant reduction
of whole stagein time.
MultiJob PanDA pilot 10
MultiJob Pilot. Payload execution
• Number of jobs adjusted one more time according to backfill
• Jobs, which not fitted, will failed with sub-status “rejected”
• PanDA jobs launched as separeted MPI ranks through special wrapper– Transformation name and input parameters translated
through file– CPU consumption time and trf exit code published in
rank report file
MultiJob PanDA pilot 11
MultiJob Pilot. StageOut
• Not require special optimization for the moment, due to not time critical operation for HPC– Optimization will be reviewed as scale will goes to
hundreds of simultaneously launched PanDA jobs by one pilot
MultiJob PanDA pilot 12
First results
• MultiJob pilot was tested with jobs from ATLAS production validated task.
• 1000 jobs was executed (100000 events generated)
• Scale was increased from 3 to 20 simultaneously launched jobs
• Significant increasing of execution time of simultaneously launched jobs was not observed