25
JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Embed Size (px)

Citation preview

Page 1: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

JRA7 and SAGA

Malcolm Illingworth, EPCC

OGF19

Chapel Hill 29/01 – 02/02 2007

Page 2: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

DEISA Objectives

• To deploy and operate a persistent, production quality, distributed supercomputing environment with continental scope

• To enable scientific discovery across a broad spectrum of science and technology. Scientific impact (enabling new science) is the only criterion for success.

• Users should not be aware of complex grid technologies) and applications transparency

• Minimal intrusion on applications

Page 3: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

JRA7 Objectives

“To develop a single way of coordinating and integrating OGSA-based services for distributed resource management in a heterogeneous environment, and to use this to integrate a variety of existing user-level tools to provide the necessary high-level services in:

- authentication, authorisation and accounting;

- job preparation, submission and monitoring;

- data movement for job input and output;

- other areas to be determined by DEISA user requirements.”

DESHL: DEISA Services for the Heterogeneous management Layer

Page 4: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Current status and future plans

• Started in May 2004• Decision taken to follow SAGA mid-2005• Project finishes in April 2008• DESHL command line tool deployed and tested at all 11 DEISA

sites• DESHL training included at DEISA user training sessions since

July 2005• Some take up from outside of DEISA• Recent focus on usability and robustness• DESHL 4.1 due for release in April• Possible inclusion by eDEISA for lifesciences portal

development (integration with EngineFrame)

Page 5: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

The Big Picture

Standards-based interfaces to allow user-level tools to interact across heterogeneous sites.

JRA7DESHL

Data-Mgt Information

DataHPC Network

Resources

HPC Site

Data-MgtUNICORE DRM Information

DataHPC Network

Resources

HPC Site

UNICORE DRM

DEISA Services for the Heterogeneous management Layer

Batch Job service

Data Management service

Information service

User tools

UserJob At a local site a user wants to run a job on the DEISA heterogeneous environment

Page 6: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

DESHL v4.1 Components

UNICORE Gateway

Server

SAGA Client Library

Grid Access Library

ARCON Client library

Command Line Tool

Client

DESHL

Page 7: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Command line tool functionality

• The precise set of operations is based upon application requirements, but focus has been on file transfer and job submission.

• Data Transfer– Upload/download files between local workstation and DEISA site– delete a file at a DEISA site– determine if a file exists on a DEISA site– list the contents of a directory on a DEISA site– rename a file on a DEISA site– copy/move a file between DEISA sites

• Job Management– determine the DEISA sites to which a user can submit a batch job to– submit a batch job to a DEISA site– terminate a batch job at a DEISA site– view the status of a batch job on a DEISA site– retrieve job stdout and stderr

Page 8: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Client Library

• Provides factory classes for access to remote job services and remote file systems

• Specific implementation classes are specified via a properties file and hidden from the caller

• Changes in implementation should not be visible to caller• Remote resources configured locally via configuration file• Jobs specified to CLT as SAGA directive scripts• SAGA directives translated to JSDL script• JSDL script is submitted to a site via Grid Library.• Grid Library returns a Task object for submitted JSDL script.

Page 9: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

SAGA Factory Classes

• SAGA interfaces obtained from factory classes• DESHLNSDir dir =

DESHLClientFactory.getNSDirFactory().getInstance(Session session);

• JobService js = DESHLClientFactory.getJobServiceFactory().getInstance(Session session);

• Caller identity(s) provided via Session object containing appropriate context objects

• TODO - Currently have UnicoreContext interface extending Context, will refactor to SAGA-compliant attribute-based Context -

• TODO – rename DESHLNSDir to NSDir

Page 10: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

NSDir interface (1)

public interface DESHLNSDir {

String[] list( String dir ) throws SAGAException, BadParameterException,

DoesNotExistException;

boolean exists(String name) throws SAGAException, BadParameterException;

boolean isDir(String name) throws SAGAException, BadParameterException,

DoesNotExistException;

boolean isFile(String name) throws SAGAException, BadParameterException,

DoesNotExistException;

Page 11: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

NSDir Interface (2)

void copy(String source, String target, int[] copyFlags) throws SAGAException, BadParameterException,

DoesNotExistException, IncorrectStateException;

void move(String source, String target, int[] moveFlags) throws SAGAException,BadParameterException,

DoesNotExistException,IncorrectStateException;

void remove(String target, int[] removeFlags) throws SAGAException, BadParameterException,

DoesNotExistException,IncorrectStateException;

void makeDir(String target, int[] makeDirFlags) throws SAGAException, BadParameterException,

IncorrectStateException;

Page 12: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

NSDir Interface (3)

• Methods implemented but not currently used:– (no persistence in CLT application, not currently relevant)

String getURL() throws SAGAException;

String getName() throws SAGAException;

void changeDir(String dir) throws SAGAException, BadParameterException, DoesNotExistException;

int getNumEntries() throws SAGAException;

String getEntry(int entry) throws SAGAException, BadParameterException;

Page 13: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Job Service Interface

public interface JobService {

Job submitJob( JobDefinition jobDef )

throws SAGAException;

String[] list(boolean showAllDetails)

throws SAGAException;

Job getJob( String jobId ) throws SAGAException;

/* not specified by SAGA but very useful */

public String[] listJobsForSite(String siteName, boolean showAllDetails) throws SAGAException;

}

Page 14: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

JobDefinition

• Contains job description as set of SAGA attributes• JobDefinition interface extends Attribute interface• Implementation defines the set of attributes we support• CLT reads SAGA definitions from a text file to build job

definition

Example simple job submission script:

#!/bin/bash# Test job script for DESHL using SAGA.## SAGA JobDefinition based directives:#$ SAGA_FileTransfer = file:///jobs/hello.sh#HOME > hello.sh#$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx#$ SAGA_JobCmd = hello.sh#$ SAGA_JobName = example job script

Page 15: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

More complex example …

# SAGA JobDefinition based directives:

#$ SAGA_JobCmd = a.out

#$ SAGA_FileTransfer = file:///unicore/a.out#HOME > a.out

#$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx

#$ SAGA_FileTransfer = file:///TestOutput#HOME < TestOutput

#$ SAGA_JobEnv = account_no=e24-sa

#$ SAGA_JobEnv = stack_limit=200MB

#$ SAGA_Memory = 24400

#$ SAGA_NumTasks = 16

#$ SAGA_NumCpus = 1

#$ SAGA_WallClockSoftLimit = 3600

Page 16: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Currently supported attributes

• SAGA_JobCmd• SAGA_JobArgs• SAGA_JobEnv• SAGA_JobName• SAGA_FileTransfer• SAGA_HostList (note: only one host can currently be specified,

DEISA does not have a broker)• SAGA_NumTasks• SAGA_NumCpus (interpreted as number of threads per task)• SAGA_Memory (host uses value to calculate stack and heap)• SAGA_WallClockSoftLimit

Page 17: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Job Interface

• Uses subset of SAGA Job interface.• Due to translation steps (SAGA-JSDL-AJO), not possible

to retrieve SAGA job definition from remote host.

public interface Job {

String getJobId(); JobState getJobState(); String getJobStateDetail(); void terminate(); /* Not specified by SAGA but required by UNICORE to * retrieve output from USPACE and free resources. */ void cleanUp( File toDir );}

Page 18: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Example job submission

Session session;…

// get the class factoryJobServiceFactory factory = DESHLClientFactory.getJobServiceFactory();

// get an instance of the job service from the factoryJobService js = factory.getInstance(session);

JobDefBuilder jobDefBuilder = new JobDefBuilder();... // build up job definition from file or arguments// get the constructed job definitionJobDef jobDef = jobDefBuilder.create();

// submit the job, return a job instanceJob submittedJob = js.submitJob( jobDef );

// get the job identifier, eg to display to the userString jobID = job.getJobId();

// get the job instance again from the job identifierJob remoteJob = js.getJob(jobID);

// get the job's statusJobState jobState = remoteJob.getJobStatus();

// retrieve the job output to a specified directoryremoteJob.fetch("/home/malcolm/joboutputdir");

Page 19: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Example copy operation

Session session;int copyFlags[] = {

NSDirFlags.copyFlags_NoRecursive, NSDirFlags.NoOverwrite }; String source =

"ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat";String target =

"ssl://admin.hpcx.ac.uk:4433/IDRIS%20ZAHIR/home/malcolm/test.dat";

// get an instance of the factoryNSDirFactory factory = DESHLClientFactory.getNSDirFactory(); // get an instance of the NSDir interface from the factoryNSDir dir = factory.getInstance(session);

// verify the source file exitsboolean sourceFileExists =

dir.exists("ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat");

// copy the file to the other sitedir.copy(source, target, copyFlags);// verify the file turned up at the remote siteboolean targetFileExists = dir.exists(target);

Page 20: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Grid Access Library (roctopus)

• Presents a generalised object-oriented model for interacting with a UNICORE grid, not purely for DESHL

• Provides a general interface that can have multiple implementations Jobs submitted to a Site as JSDL scripts, returns a Task.

• Presents Task interface to represent executing jobs.

• All of this hidden from the user/application developer

• Authentication/Authorisation is by existing UNICORE mechanisms ie. long-lived x509 pairs

Grid

File

Storage

Site

1

0.*

1

1

0.*

0.*

Page 21: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Grid Library interface

• Provides dedicated functions for file management/transfer• Job submission/management via rich Task interface• Job submitted as JSDL, Task instance returned• List of tasks at a remote site can be retrieved and manipulated

example:JobDefinition jobDef;

XmlJobDefinitionDocument jsdl = JobDefJSDLConverter.jobDefToJSDL( jobDef );

host = new UnicoreLocation( unicoreLocationStr );

Site site = grid.locateSite( host );

final Task task = site.submit( jobSubmission );

task.startASync( new File[] {} );

Page 22: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Current Issues (1)

– SAGA defines job identifiers as ‘[backend url]-[native id]’

• Example ‘[ssh://remote:host.net:22/]-[1234]’

– (We escape out any characters likely to be a problem on the command line)

– Fine programatically …

– From a CLT perspective, not user friendly$ deshl submit –q ssl://myhost.ac.uk:4433/myNJS sleeper.sh

Your job: ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131, has been successfully submitted.

$ deshl status ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131

Page 23: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Current Issues (2)

– Could save job id to a file and use simpler naming convention

– DESHL allows aliases to be defined for remote sites$ deshl submit –q myHost sleeper.sh

Your job myHost%2F957383131 has been successfully submitted

nsdir.copy(“myhosta/home/malcolm/test.dat”,

“myhostb/home/malcolm/test.dat”);

– Aliases are currently specified and handled outside of the SAGA standard, we would like to include this as an optional attribute in the context

Page 24: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Current Issues (3)

• Retrieving job definition:– Not currently supported …– Job definition originally as SAGA script– Not possible to retrieve original SAGA job definition from remote host, as

host does not receive or understand this, would need to rely on local persistence

– May be possible to get JSDL description, reverse translate to SAGA– (could store original SAGA script in a local database with job id)

• Debugging / Exception reporting:– Layered architecture can be difficult to debug.– Sometimes unclear if a problem is in middleware or on remote host, very

clear exception reporting required or user will tend to blame middleware for operational problems on host.

Page 25: JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007

Questions … ?

http://forge.nesc.ac.uk/projects/deisa-jra7/

[email protected]