Upload
amber-warren
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
JRA7 and SAGA
Malcolm Illingworth, EPCC
OGF19
Chapel Hill 29/01 – 02/02 2007
DEISA Objectives
• To deploy and operate a persistent, production quality, distributed supercomputing environment with continental scope
• To enable scientific discovery across a broad spectrum of science and technology. Scientific impact (enabling new science) is the only criterion for success.
• Users should not be aware of complex grid technologies) and applications transparency
• Minimal intrusion on applications
JRA7 Objectives
“To develop a single way of coordinating and integrating OGSA-based services for distributed resource management in a heterogeneous environment, and to use this to integrate a variety of existing user-level tools to provide the necessary high-level services in:
- authentication, authorisation and accounting;
- job preparation, submission and monitoring;
- data movement for job input and output;
- other areas to be determined by DEISA user requirements.”
DESHL: DEISA Services for the Heterogeneous management Layer
Current status and future plans
• Started in May 2004• Decision taken to follow SAGA mid-2005• Project finishes in April 2008• DESHL command line tool deployed and tested at all 11 DEISA
sites• DESHL training included at DEISA user training sessions since
July 2005• Some take up from outside of DEISA• Recent focus on usability and robustness• DESHL 4.1 due for release in April• Possible inclusion by eDEISA for lifesciences portal
development (integration with EngineFrame)
The Big Picture
Standards-based interfaces to allow user-level tools to interact across heterogeneous sites.
JRA7DESHL
Data-Mgt Information
DataHPC Network
Resources
HPC Site
Data-MgtUNICORE DRM Information
DataHPC Network
Resources
HPC Site
UNICORE DRM
DEISA Services for the Heterogeneous management Layer
Batch Job service
Data Management service
Information service
User tools
UserJob At a local site a user wants to run a job on the DEISA heterogeneous environment
DESHL v4.1 Components
UNICORE Gateway
Server
SAGA Client Library
Grid Access Library
ARCON Client library
Command Line Tool
Client
DESHL
Command line tool functionality
• The precise set of operations is based upon application requirements, but focus has been on file transfer and job submission.
• Data Transfer– Upload/download files between local workstation and DEISA site– delete a file at a DEISA site– determine if a file exists on a DEISA site– list the contents of a directory on a DEISA site– rename a file on a DEISA site– copy/move a file between DEISA sites
• Job Management– determine the DEISA sites to which a user can submit a batch job to– submit a batch job to a DEISA site– terminate a batch job at a DEISA site– view the status of a batch job on a DEISA site– retrieve job stdout and stderr
Client Library
• Provides factory classes for access to remote job services and remote file systems
• Specific implementation classes are specified via a properties file and hidden from the caller
• Changes in implementation should not be visible to caller• Remote resources configured locally via configuration file• Jobs specified to CLT as SAGA directive scripts• SAGA directives translated to JSDL script• JSDL script is submitted to a site via Grid Library.• Grid Library returns a Task object for submitted JSDL script.
SAGA Factory Classes
• SAGA interfaces obtained from factory classes• DESHLNSDir dir =
DESHLClientFactory.getNSDirFactory().getInstance(Session session);
• JobService js = DESHLClientFactory.getJobServiceFactory().getInstance(Session session);
• Caller identity(s) provided via Session object containing appropriate context objects
• TODO - Currently have UnicoreContext interface extending Context, will refactor to SAGA-compliant attribute-based Context -
• TODO – rename DESHLNSDir to NSDir
NSDir interface (1)
public interface DESHLNSDir {
String[] list( String dir ) throws SAGAException, BadParameterException,
DoesNotExistException;
boolean exists(String name) throws SAGAException, BadParameterException;
boolean isDir(String name) throws SAGAException, BadParameterException,
DoesNotExistException;
boolean isFile(String name) throws SAGAException, BadParameterException,
DoesNotExistException;
NSDir Interface (2)
void copy(String source, String target, int[] copyFlags) throws SAGAException, BadParameterException,
DoesNotExistException, IncorrectStateException;
void move(String source, String target, int[] moveFlags) throws SAGAException,BadParameterException,
DoesNotExistException,IncorrectStateException;
void remove(String target, int[] removeFlags) throws SAGAException, BadParameterException,
DoesNotExistException,IncorrectStateException;
void makeDir(String target, int[] makeDirFlags) throws SAGAException, BadParameterException,
IncorrectStateException;
NSDir Interface (3)
• Methods implemented but not currently used:– (no persistence in CLT application, not currently relevant)
String getURL() throws SAGAException;
String getName() throws SAGAException;
void changeDir(String dir) throws SAGAException, BadParameterException, DoesNotExistException;
int getNumEntries() throws SAGAException;
String getEntry(int entry) throws SAGAException, BadParameterException;
Job Service Interface
public interface JobService {
Job submitJob( JobDefinition jobDef )
throws SAGAException;
String[] list(boolean showAllDetails)
throws SAGAException;
Job getJob( String jobId ) throws SAGAException;
/* not specified by SAGA but very useful */
public String[] listJobsForSite(String siteName, boolean showAllDetails) throws SAGAException;
}
JobDefinition
• Contains job description as set of SAGA attributes• JobDefinition interface extends Attribute interface• Implementation defines the set of attributes we support• CLT reads SAGA definitions from a text file to build job
definition
Example simple job submission script:
#!/bin/bash# Test job script for DESHL using SAGA.## SAGA JobDefinition based directives:#$ SAGA_FileTransfer = file:///jobs/hello.sh#HOME > hello.sh#$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx#$ SAGA_JobCmd = hello.sh#$ SAGA_JobName = example job script
More complex example …
# SAGA JobDefinition based directives:
#$ SAGA_JobCmd = a.out
#$ SAGA_FileTransfer = file:///unicore/a.out#HOME > a.out
#$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx
#$ SAGA_FileTransfer = file:///TestOutput#HOME < TestOutput
#$ SAGA_JobEnv = account_no=e24-sa
#$ SAGA_JobEnv = stack_limit=200MB
#$ SAGA_Memory = 24400
#$ SAGA_NumTasks = 16
#$ SAGA_NumCpus = 1
#$ SAGA_WallClockSoftLimit = 3600
Currently supported attributes
• SAGA_JobCmd• SAGA_JobArgs• SAGA_JobEnv• SAGA_JobName• SAGA_FileTransfer• SAGA_HostList (note: only one host can currently be specified,
DEISA does not have a broker)• SAGA_NumTasks• SAGA_NumCpus (interpreted as number of threads per task)• SAGA_Memory (host uses value to calculate stack and heap)• SAGA_WallClockSoftLimit
Job Interface
• Uses subset of SAGA Job interface.• Due to translation steps (SAGA-JSDL-AJO), not possible
to retrieve SAGA job definition from remote host.
public interface Job {
String getJobId(); JobState getJobState(); String getJobStateDetail(); void terminate(); /* Not specified by SAGA but required by UNICORE to * retrieve output from USPACE and free resources. */ void cleanUp( File toDir );}
Example job submission
Session session;…
// get the class factoryJobServiceFactory factory = DESHLClientFactory.getJobServiceFactory();
// get an instance of the job service from the factoryJobService js = factory.getInstance(session);
JobDefBuilder jobDefBuilder = new JobDefBuilder();... // build up job definition from file or arguments// get the constructed job definitionJobDef jobDef = jobDefBuilder.create();
// submit the job, return a job instanceJob submittedJob = js.submitJob( jobDef );
// get the job identifier, eg to display to the userString jobID = job.getJobId();
// get the job instance again from the job identifierJob remoteJob = js.getJob(jobID);
// get the job's statusJobState jobState = remoteJob.getJobStatus();
// retrieve the job output to a specified directoryremoteJob.fetch("/home/malcolm/joboutputdir");
Example copy operation
Session session;int copyFlags[] = {
NSDirFlags.copyFlags_NoRecursive, NSDirFlags.NoOverwrite }; String source =
"ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat";String target =
"ssl://admin.hpcx.ac.uk:4433/IDRIS%20ZAHIR/home/malcolm/test.dat";
// get an instance of the factoryNSDirFactory factory = DESHLClientFactory.getNSDirFactory(); // get an instance of the NSDir interface from the factoryNSDir dir = factory.getInstance(session);
// verify the source file exitsboolean sourceFileExists =
dir.exists("ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat");
// copy the file to the other sitedir.copy(source, target, copyFlags);// verify the file turned up at the remote siteboolean targetFileExists = dir.exists(target);
Grid Access Library (roctopus)
• Presents a generalised object-oriented model for interacting with a UNICORE grid, not purely for DESHL
• Provides a general interface that can have multiple implementations Jobs submitted to a Site as JSDL scripts, returns a Task.
• Presents Task interface to represent executing jobs.
• All of this hidden from the user/application developer
• Authentication/Authorisation is by existing UNICORE mechanisms ie. long-lived x509 pairs
Grid
File
Storage
Site
1
0.*
1
1
0.*
0.*
Grid Library interface
• Provides dedicated functions for file management/transfer• Job submission/management via rich Task interface• Job submitted as JSDL, Task instance returned• List of tasks at a remote site can be retrieved and manipulated
example:JobDefinition jobDef;
…
XmlJobDefinitionDocument jsdl = JobDefJSDLConverter.jobDefToJSDL( jobDef );
host = new UnicoreLocation( unicoreLocationStr );
Site site = grid.locateSite( host );
final Task task = site.submit( jobSubmission );
task.startASync( new File[] {} );
Current Issues (1)
– SAGA defines job identifiers as ‘[backend url]-[native id]’
• Example ‘[ssh://remote:host.net:22/]-[1234]’
– (We escape out any characters likely to be a problem on the command line)
– Fine programatically …
– From a CLT perspective, not user friendly$ deshl submit –q ssl://myhost.ac.uk:4433/myNJS sleeper.sh
Your job: ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131, has been successfully submitted.
$ deshl status ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131
Current Issues (2)
– Could save job id to a file and use simpler naming convention
– DESHL allows aliases to be defined for remote sites$ deshl submit –q myHost sleeper.sh
Your job myHost%2F957383131 has been successfully submitted
nsdir.copy(“myhosta/home/malcolm/test.dat”,
“myhostb/home/malcolm/test.dat”);
– Aliases are currently specified and handled outside of the SAGA standard, we would like to include this as an optional attribute in the context
Current Issues (3)
• Retrieving job definition:– Not currently supported …– Job definition originally as SAGA script– Not possible to retrieve original SAGA job definition from remote host, as
host does not receive or understand this, would need to rely on local persistence
– May be possible to get JSDL description, reverse translate to SAGA– (could store original SAGA script in a local database with job id)
• Debugging / Exception reporting:– Layered architecture can be difficult to debug.– Sometimes unclear if a problem is in middleware or on remote host, very
clear exception reporting required or user will tend to blame middleware for operational problems on host.